Exploratory Data Analysis

In [1]:
import os
import networkx as nx
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 
import numpy as np
import geopandas as gpd
import shapely
from shapely.geometry import Point
import math
import time
from matplotlib_scalebar.scalebar import ScaleBar
from mpl_toolkits.axes_grid1 import make_axes_locatable
# Configure Notebook


import warnings 
warnings.filterwarnings('ignore')
In [222]:
import matplotlib as mpl
plt.rcParams['figure.dpi'] = 450 # increase resolution

# Configure Notebook
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

Reading Data

Import Notebook

First import notebook from previously exported raw data file

In [117]:
trips_data = pd.read_csv('trips_raw_data.csv')
trips_data.head()
Out[117]:
Unnamed: 0 Trip Id Subscription Id Trip Duration Start Station Id Start Time Start Station Name End Station Id End Time End Station Name ... Temp (°C) Dew Point Temp (°C) Rel Hum (%) Wind Dir (10s deg) Wind Spd (km/h) Visibility (km) Stn Press (kPa) Hmdx Wind Chill Weather
0 58 712441 NaN 274 7006.0 2017-01-01 00:03:00-05:00 Bay St / College St (East Side) 7021.0 2017-01-01 00:08:00-05:00 Bay St / Albert St ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
1 59 712442 NaN 538 7046.0 2017-01-01 00:03:00-05:00 Niagara St / Richmond St W 7147.0 2017-01-01 00:12:00-05:00 King St W / Fraser Ave ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
2 60 712443 NaN 992 7048.0 2017-01-01 00:05:00-05:00 Front St W / Yonge St (Hockey Hall of Fame) 7089.0 2017-01-01 00:22:00-05:00 Church St / Wood St ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
3 61 712444 NaN 1005 7177.0 2017-01-01 00:09:00-05:00 East Liberty St / Pirandello St 7202.0 2017-01-01 00:26:00-05:00 Queen St W / York St (City Hall) ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
4 62 712445 NaN 645 7203.0 2017-01-01 00:14:00-05:00 Bathurst St/Queens Quay(Billy Bishop Airport) 7010.0 2017-01-01 00:25:00-05:00 King St W / Spadina Ave ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN

5 rows × 24 columns

In [118]:
trips_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8039088 entries, 0 to 8039087
Data columns (total 24 columns):
 #   Column               Dtype  
---  ------               -----  
 0   Unnamed: 0           int64  
 1   Trip Id              int64  
 2   Subscription Id      float64
 3   Trip Duration        int64  
 4   Start Station Id     float64
 5   Start Time           object 
 6   Start Station Name   object 
 7   End Station Id       float64
 8   End Time             object 
 9   End Station Name     object 
 10  Bike Id              float64
 11  User Type            object 
 12  merge_time           object 
 13  Date/Time            object 
 14  Temp (°C)            float64
 15  Dew Point Temp (°C)  float64
 16  Rel Hum (%)          float64
 17  Wind Dir (10s deg)   float64
 18  Wind Spd (km/h)      float64
 19  Visibility (km)      float64
 20  Stn Press (kPa)      float64
 21  Hmdx                 float64
 22  Wind Chill           float64
 23  Weather              object 
dtypes: float64(13), int64(3), object(8)
memory usage: 1.4+ GB

Setting Datetime

Converting Start Time and End Time to Datetime instead of object

In [119]:
trips_data['Start Time'] = pd.DatetimeIndex(trips_data['Start Time']).tz_convert(tz = 'EST')
trips_data['End Time'] = pd.DatetimeIndex(trips_data['End Time']).tz_convert(tz = 'EST')
In [120]:
trips_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8039088 entries, 0 to 8039087
Data columns (total 24 columns):
 #   Column               Dtype              
---  ------               -----              
 0   Unnamed: 0           int64              
 1   Trip Id              int64              
 2   Subscription Id      float64            
 3   Trip Duration        int64              
 4   Start Station Id     float64            
 5   Start Time           datetime64[ns, EST]
 6   Start Station Name   object             
 7   End Station Id       float64            
 8   End Time             datetime64[ns, EST]
 9   End Station Name     object             
 10  Bike Id              float64            
 11  User Type            object             
 12  merge_time           object             
 13  Date/Time            object             
 14  Temp (°C)            float64            
 15  Dew Point Temp (°C)  float64            
 16  Rel Hum (%)          float64            
 17  Wind Dir (10s deg)   float64            
 18  Wind Spd (km/h)      float64            
 19  Visibility (km)      float64            
 20  Stn Press (kPa)      float64            
 21  Hmdx                 float64            
 22  Wind Chill           float64            
 23  Weather              object             
dtypes: datetime64[ns, EST](2), float64(13), int64(3), object(6)
memory usage: 1.4+ GB
In [121]:
trips_data.head()
Out[121]:
Unnamed: 0 Trip Id Subscription Id Trip Duration Start Station Id Start Time Start Station Name End Station Id End Time End Station Name ... Temp (°C) Dew Point Temp (°C) Rel Hum (%) Wind Dir (10s deg) Wind Spd (km/h) Visibility (km) Stn Press (kPa) Hmdx Wind Chill Weather
0 58 712441 NaN 274 7006.0 2017-01-01 00:03:00-05:00 Bay St / College St (East Side) 7021.0 2017-01-01 00:08:00-05:00 Bay St / Albert St ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
1 59 712442 NaN 538 7046.0 2017-01-01 00:03:00-05:00 Niagara St / Richmond St W 7147.0 2017-01-01 00:12:00-05:00 King St W / Fraser Ave ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
2 60 712443 NaN 992 7048.0 2017-01-01 00:05:00-05:00 Front St W / Yonge St (Hockey Hall of Fame) 7089.0 2017-01-01 00:22:00-05:00 Church St / Wood St ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
3 61 712444 NaN 1005 7177.0 2017-01-01 00:09:00-05:00 East Liberty St / Pirandello St 7202.0 2017-01-01 00:26:00-05:00 Queen St W / York St (City Hall) ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
4 62 712445 NaN 645 7203.0 2017-01-01 00:14:00-05:00 Bathurst St/Queens Quay(Billy Bishop Airport) 7010.0 2017-01-01 00:25:00-05:00 King St W / Spadina Ave ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN

5 rows × 24 columns

Since before 2018 user data is incomplete, we will exclude data before 2018 from Start Time

In [122]:
trips_data = trips_data[trips_data['Start Time']>'2018-01-01']
trips_data.head()
Out[122]:
Unnamed: 0 Trip Id Subscription Id Trip Duration Start Station Id Start Time Start Station Name End Station Id End Time End Station Name ... Temp (°C) Dew Point Temp (°C) Rel Hum (%) Wind Dir (10s deg) Wind Spd (km/h) Visibility (km) Stn Press (kPa) Hmdx Wind Chill Weather
1392031 1392089 2383648 NaN 393 7018.0 2018-01-01 00:47:00-05:00 Bremner Blvd / Rees St 7176.0 2018-01-01 00:54:00-05:00 Bathurst St / Fort York Blvd ... -16.8 -21.1 70.0 NaN 4.0 16.1 102.1 NaN -20.0 NaN
1392032 1392090 2383649 NaN 625 7184.0 2018-01-01 00:52:00-05:00 Ossington Ave / College St 7191.0 2018-01-01 01:03:00-05:00 Central Tech (Harbord St) ... -16.8 -21.1 70.0 NaN 4.0 16.1 102.1 NaN -20.0 NaN
1392033 1392091 2383650 NaN 233 7235.0 2018-01-01 00:55:00-05:00 Bay St / College St (West Side) - SMART 7021.0 2018-01-01 00:59:00-05:00 Bay St / Albert St ... -16.8 -21.1 70.0 NaN 4.0 16.1 102.1 NaN -20.0 NaN
1392034 1392092 2383651 NaN 1138 7202.0 2018-01-01 00:57:00-05:00 Queen St W / York St (City Hall) 7020.0 2018-01-01 01:16:00-05:00 Phoebe St / Spadina Ave ... -16.8 -21.1 70.0 NaN 4.0 16.1 102.1 NaN -20.0 NaN
1392035 1392093 2383652 NaN 703 7004.0 2018-01-01 01:00:00-05:00 University Ave / Elm St 7060.0 2018-01-01 01:12:00-05:00 Princess St / Adelaide St E ... -16.8 -21.1 70.0 NaN 4.0 16.1 102.1 NaN -20.0 NaN

5 rows × 24 columns

User Analysis

General Analysis without Weather factors

In this section, casual and annual members are compared regardless of the weather conditions to provide a general overview and behaviour differences with a Frequency of Days

In [123]:
trips_data_day = trips_data
trips_data_day=trips_data_day.groupby([pd.Grouper(key='Start Time', freq='D')]).agg(
    rides=pd.NamedAgg(column='User Type', aggfunc=lambda x: (x.count())),
    annual_members = pd.NamedAgg(column='User Type', aggfunc=lambda x: (x[x=='Annual Member'].count())),
    casual_members = pd.NamedAgg(column='User Type', aggfunc=lambda x: (x[x=='Casual Member'].count())))

trips_data_day['rides'] = trips_data_day['rides'].astype('Int64')
trips_data_day['annual_members'] = trips_data_day['annual_members'].astype('Int64')
trips_data_day['casual_members'] = trips_data_day['casual_members'].astype('Int64')
trips_data_day.head()
Out[123]:
rides annual_members casual_members
Start Time
2018-01-01 00:00:00-05:00 243 235 8
2018-01-02 00:00:00-05:00 953 934 19
2018-01-03 00:00:00-05:00 1181 1166 15
2018-01-04 00:00:00-05:00 1169 1153 16
2018-01-05 00:00:00-05:00 783 773 10

Adding workday indicator

In [124]:
trips_data_day['workday'] = np.where((trips_data_day.index.dayofweek) < 5,'True','False')
trips_data_day = trips_data_day.dropna()
trips_data_day.head()
Out[124]:
rides annual_members casual_members workday
Start Time
2018-01-01 00:00:00-05:00 243 235 8 True
2018-01-02 00:00:00-05:00 953 934 19 True
2018-01-03 00:00:00-05:00 1181 1166 15 True
2018-01-04 00:00:00-05:00 1169 1153 16 True
2018-01-05 00:00:00-05:00 783 773 10 True

Plotting Graph to compare workday vs non workday behaviour

First, a hue semantic to show distrubtion

In [125]:
import matplotlib.patches as  mpatches
plt.figure(figsize=(7, 5))
trips_data_day['casual_members']=trips_data_day['casual_members'].astype('float64')
trips_data_day['annual_members']=trips_data_day['annual_members'].astype('float64')
plt.title('Comparison of Casual Members and Annual Members \n on Working and Non-working days', fontsize=18)
ax=sns.kdeplot(data=trips_data_day, x="casual_members", y="annual_members", hue="workday")
ax.set_xlabel('Casual Membership',fontsize=12)
ax.set_ylabel('Annual Membership',fontsize=12)
plt.show()

Combining with scatter plot to show distribution

In [126]:
plt.figure(figsize=(7, 5))
plt.title('Comparison of Casual Members and Annual Members \n on Working and Non-working days', fontsize=18)
bx=sns.scatterplot(data=trips_data_day, x="casual_members", y="annual_members",hue="workday")
handles, labels = ax.get_legend_handles_labels()
bx.legend( title='workday')
bx.set_xlabel('Casual Membership',fontsize=12)
bx.set_ylabel('Annual Membership',fontsize=12)
plt.show()

Adding violin plot to further illustrate the distribution for both annual and casual members

In [127]:
plt.figure(figsize=(8, 5))
plt.title('Comparison of Annual Members \n on Working and Non-working days', fontsize=18)
cx1 = sns.violinplot(x="workday", y="annual_members", data=trips_data_day)
cx1.set_xlabel('Workday',fontsize=16)
cx1.set_ylabel('Annual Rides per Day',fontsize=16)
plt.show()
In [128]:
plt.figure(figsize=(8, 5))
plt.title('Comparison of Casual Members \n on Working and Non-working days', fontsize=18)
cx2 = sns.violinplot(x="workday", y="casual_members", data=trips_data_day)
cx2.set_xlabel('Workday',fontsize=16)
cx2.set_ylabel('Annual Rides per Day',fontsize=16)
plt.show()

From the first two graphs above, it is evident that there are more annual membership users utilizing the bike share service on weekdays, suggesting that these users are using the service for possible commuting to and from work. On the contrary, for casual members, the patterns are more spread out with a higher usage during the weekends, suggesting that casual users are utilizing the service for possible leisure purposes without specific pattern.

This is further demonstrated through the violin plot distribution graphs, where it is evident to see that for weekdays, annual membership users have a more even distributed rides per day in comparsion to casual members suggesting that casual members are less likely to use this service on weekdays. However, the probability for casual members to use the bike share service increased for week-ends, comparing to annual users where weekday and weekend showed less variations in the probability of ridership.

However, in order to further justify the assumption that annual users are utilizing for commuting purposes, hours of use are examined below.

Hourly Breakdown

To further examine the purposes of each member type, the start time is grouped together at hourly interval

In [129]:
trips_data_hour = trips_data
trips_data_hour=trips_data_hour.groupby([pd.Grouper(key='Start Time', freq='h')]).agg(
    rides=pd.NamedAgg(column='User Type', aggfunc=lambda x: (x.count())),
    annual_members = pd.NamedAgg(column='User Type', aggfunc=lambda x: (x[x=='Annual Member'].count())),
    casual_members = pd.NamedAgg(column='User Type', aggfunc=lambda x: (x[x=='Casual Member'].count())))

trips_data_hour['rides'] = trips_data_hour['rides'].astype('Int64')
trips_data_hour['annual_members'] = trips_data_hour['annual_members'].astype('Int64')
trips_data_hour['casual_members'] = trips_data_hour['casual_members'].astype('Int64')
trips_data_hour.head()
Out[129]:
rides annual_members casual_members
Start Time
2018-01-01 00:00:00-05:00 4 4 0
2018-01-01 01:00:00-05:00 10 10 0
2018-01-01 02:00:00-05:00 6 5 1
2018-01-01 03:00:00-05:00 14 13 1
2018-01-01 04:00:00-05:00 6 6 0

Since for this comparison date does not matter, each hour will be grouped together and taking the average of the trips taken by each user type

In [130]:
trips_data_hour['datetime']=trips_data_hour.index
trips_data_hour['time']=trips_data_hour['datetime'].dt.hour
trips_data_hour = trips_data_hour.drop(columns=['datetime']).reset_index()
trips_data_hour = trips_data_hour.groupby(by='time').mean()
trips_data_hour.head()
Out[130]:
rides annual_members casual_members
time
0 70.648936 47.155706 23.49323
1 43.708738 28.5 15.208738
2 29.705594 19.308145 10.397448
3 15.173913 9.894843 5.27907
4 11.837972 8.857853 2.980119

Now plotting a distribution graph to show time of day and usage comparison

In [131]:
plt.figure(figsize=(7, 5))
plt.title('Average hourly rides per day', fontsize=18)
dx1=sns.lineplot(data=trips_data_hour, x='time', y="annual_members",label='Annual Members')
dx2=sns.lineplot(data=trips_data_hour, x='time', y="casual_members",label='Casual Members')
dx1.set_xlabel('Hour of the Day',fontsize=12)
dx1.set_ylabel('Average Rides per Hour',fontsize=12)
plt.show()

This graph clearly indicated 2 peak time for annual m embers at Hour 8 and Hour 17, aligning with rush hour times and matches with the expectation that most annual members are utilziing bike share service for the purpose of commuting to and from work.

On the contrary, there is a lack of a significant peak hour for casual members aside from the gradual increase between Hour 12 and Hour 19 which suggests a more evenly spread out use such as for leisure or a specific purpose that is not time dependant.

General Analysis with Weather factors

To further examine ridership type behaviours, weather factors are included

Weather data

First checking whether there are any null values in the Weather column

In [132]:
trips_data.isnull().sum(axis=0).to_frame('count')
Out[132]:
count
Unnamed: 0 0
Trip Id 0
Subscription Id 1840078
Trip Duration 0
Start Station Id 0
Start Time 0
Start Station Name 0
End Station Id 0
End Time 0
End Station Name 0
Bike Id 1840078
User Type 0
merge_time 0
Date/Time 0
Temp (°C) 35653
Dew Point Temp (°C) 42397
Rel Hum (%) 41557
Wind Dir (10s deg) 372776
Wind Spd (km/h) 16264
Visibility (km) 18401
Stn Press (kPa) 35739
Hmdx 4491474
Wind Chill 6096338
Weather 6071636

Clear vs Precipitation

In [133]:
trips_data.groupby('Weather')['Trip Id'].count().sort_values(ascending=False)
Out[133]:
Weather
Rain                               204564
Fog                                134632
Snow                               107377
Rain,Fog                            56396
Haze                                31472
Thunderstorms,Rain                   7768
Moderate Rain                        7247
Moderate Rain,Fog                    5962
Thunderstorms,Heavy Rain,Fog         4110
Thunderstorms                        4095
Thunderstorms,Moderate Rain          3399
Rain,Snow                            2503
Heavy Rain,Fog                       1752
Thunderstorms,Fog                    1403
Freezing Rain,Fog                    1064
Moderate Snow                         398
Thunderstorms,Heavy Rain              324
Freezing Rain,Snow                    251
Heavy Snow                            203
Freezing Rain                         196
Snow,Blowing Snow                     109
Thunderstorms,Moderate Rain,Fog        88
Haze,Blowing Snow                      82
Thunderstorms,Rain,Fog                 26
Name: Trip Id, dtype: int64

From above since there are no clear indication, the assumption is that for the null values in weather those will be clear

The dataset will be grouped into hourly breakdown with ridership type, max temperature and weather condition

In [134]:
trips_data_w= trips_data
trips_data_w['Weather'].fillna('clearday', inplace=True)

trips_data_w=trips_data_w.groupby([pd.Grouper(key='Start Time', freq='D')]).agg(
    rides=pd.NamedAgg(column='User Type', aggfunc=lambda x: (x.count())),
    annual_members = pd.NamedAgg(column='User Type', aggfunc=lambda x: (x[x=='Annual Member'].count())),
    casual_members = pd.NamedAgg(column='User Type', aggfunc=lambda x: (x[x=='Casual Member'].count())),
    weather_count= pd.NamedAgg(column='Weather', aggfunc=lambda x: (x.count())),
    weather_rain_count = pd.NamedAgg(column='Weather', aggfunc=lambda x: (x[x=='clearday'].count()))
)
trips_data_w['rides'] = trips_data_w['rides'].astype('Int64')
trips_data_w['annual_members'] = trips_data_w['annual_members'].astype('Int64')
trips_data_w['casual_members'] = trips_data_w['casual_members'].astype('Int64')
trips_data_w['weather_per']= trips_data_w[['weather_rain_count']].div(trips_data_w.weather_count,axis=0)
trips_data_w['weather_per']=trips_data_w['weather_per']*100
trips_data_w['weather']= np.where(trips_data_w['weather_per']>= 50.0 , 'Clear', 'Precipitation')
trips_data_w = trips_data_w[['rides','annual_members','casual_members','weather']]

# View DataFrame
trips_data_w.head(10)
Out[134]:
rides annual_members casual_members weather
Start Time
2018-01-01 00:00:00-05:00 243 235 8 Clear
2018-01-02 00:00:00-05:00 953 934 19 Clear
2018-01-03 00:00:00-05:00 1181 1166 15 Precipitation
2018-01-04 00:00:00-05:00 1169 1153 16 Clear
2018-01-05 00:00:00-05:00 783 773 10 Clear
2018-01-06 00:00:00-05:00 398 390 8 Clear
2018-01-07 00:00:00-05:00 524 514 10 Clear
2018-01-08 00:00:00-05:00 1013 998 15 Clear
2018-01-09 00:00:00-05:00 1890 1841 49 Clear
2018-01-10 00:00:00-05:00 1877 1835 42 Clear

Dropping off any time where rides is 0 and NA

In [135]:
trips_data_w_d =trips_data_w[(trips_data_w[['rides']] != 0).all(axis=1)]
trips_data_w_d=trips_data_w_d.dropna()
trips_data_w_d.head(10)
Out[135]:
rides annual_members casual_members weather
Start Time
2018-01-01 00:00:00-05:00 243 235 8 Clear
2018-01-02 00:00:00-05:00 953 934 19 Clear
2018-01-03 00:00:00-05:00 1181 1166 15 Precipitation
2018-01-04 00:00:00-05:00 1169 1153 16 Clear
2018-01-05 00:00:00-05:00 783 773 10 Clear
2018-01-06 00:00:00-05:00 398 390 8 Clear
2018-01-07 00:00:00-05:00 524 514 10 Clear
2018-01-08 00:00:00-05:00 1013 998 15 Clear
2018-01-09 00:00:00-05:00 1890 1841 49 Clear
2018-01-10 00:00:00-05:00 1877 1835 42 Clear

Plotting Visual Representation

First a box plot is used to compare behaviour changes due to weather conditions for all riders

In [136]:
plt.figure(figsize=(8, 5))
plt.title('Effect of Weather on Total Ridership', fontsize=18)
trips_data_w_d['rides']=trips_data_w_d['rides'].astype('float64')
ex = sns.boxplot(x="weather", y="rides", data=trips_data_w_d)
ex.set_ylim(0,16000)
ex.set_xlabel('Weather Conditions',fontsize=16)
ex.set_ylabel('Rides per Day',fontsize=16)
plt.show()

Adding a violin plot

In [137]:
plt.figure(figsize=(8, 5))
plt.title('Effect of Weather on Total Ridership', fontsize=18)
trips_data_w_d['rides']=trips_data_w_d['rides'].astype('float64')
fx = sns.violinplot(x="weather", y="rides", data=trips_data_w_d)
fx.set_ylim(0,16000)
fx.set_xlabel('Weather Conditions',fontsize=16)
fx.set_ylabel('Rides per Day',fontsize=16)
plt.show()

From a total ridership perspective, it is evident that when the weather is clear, there is a more even distribution of riders per day ranging from 2000 to 10,000. Whereas when there is precipitation, the probability of having a higher ridership per day sigifnicantly decreased and that the probability of low ridership per day (in this case around 2000 riders) is the highest by a significant margin as well.

To further explore this, annual vs casual members are examined individually

Annual members are first looked at using both box plot and violin plot

In [138]:
plt.figure(figsize=(8, 5))
plt.title('Effect of Weather on Annual Ridership', fontsize=18)
trips_data_w_d['annual_members']=trips_data_w_d['annual_members'].astype('float64')
gx = sns.boxplot(x="weather", y="annual_members", data=trips_data_w_d)
gx.set_ylim(0,16000)
gx.set_xlabel('Weather Conditions',fontsize=16)
gx.set_ylabel('Annual Member Rides per Day',fontsize=16)
plt.show()
In [139]:
plt.figure(figsize=(8, 5))
plt.title('Effect of Weather on Annual Ridership', fontsize=18)
hx = sns.violinplot(x="weather", y="annual_members", data=trips_data_w_d)
hx.set_ylim(0,16000)
hx.set_xlabel('Weather Conditions',fontsize=16)
hx.set_ylabel('Annual Member Rides per Day',fontsize=16)
plt.show()

Now repeat this exercise for casual members

In [140]:
plt.figure(figsize=(8, 5))
plt.title('Effect of Weather on Casual Ridership', fontsize=18)
trips_data_w_d['casual_members']=trips_data_w_d['casual_members'].astype('float64')
ix = sns.boxplot(x="weather", y="casual_members", data=trips_data_w_d)
ix.set_xlabel('Weather Conditions',fontsize=16)
ix.set_ylabel('Annual Member Rides per Day',fontsize=16)
plt.show()
In [141]:
plt.figure(figsize=(8, 5))
plt.title('Effect of Weather on Casual Ridership', fontsize=18)
trips_data_w_d['casual_members']=trips_data_w_d['casual_members'].astype('float64')
jx = sns.violinplot(x="weather", y="casual_members", data=trips_data_w_d)
jx.set_xlabel('Weather Conditions',fontsize=16)
jx.set_ylabel('Casual Member Rides per Day',fontsize=16)
plt.show()

Based on the boxplot and the violin plot above, it is evident that weather does have a significant impact on the probability of ridership per day for both casual and annual members.

Comparing the two types of riders, casual members are more impacted by weather conditions than annual members, especially the range of the probability of of ridership where with clear sky the probability ranges to 13,000 riders, and for precipitation days the range decreased to 9,000 per day. However, the probability range for annual members did not vary much, suggesting that there is a higher group of riders with annual members are not impacted by precipitation.

Other Weather Factors influencing Ridership

In this section, in addition to precipitation, other weather factors are explored to determine its impact on ridership

Data Grouping

To explore additional factors, a new dataframe is used to incorporate max temperature, min temperature, average relative humidity, and mwean wind speed on a daily basis.

In [142]:
trips_data_wa= trips_data

trips_data_wa=trips_data_wa.groupby([pd.Grouper(key='Start Time', freq='D')]).agg(
    rides=pd.NamedAgg(column='User Type', aggfunc=lambda x: (x.count())),
    annual_members = pd.NamedAgg(column='User Type', aggfunc=lambda x: (x[x=='Annual Member'].count())),
    casual_members = pd.NamedAgg(column='User Type', aggfunc=lambda x: (x[x=='Casual Member'].count())),
    max_temp = pd.NamedAgg(column='Temp (°C)', aggfunc=lambda x: (x.max())),
    min_temp = pd.NamedAgg(column='Temp (°C)', aggfunc=lambda x: (x.min())),    
    rel_hum= pd.NamedAgg(column='Rel Hum (%)', aggfunc=lambda x: (x.mean())),
    wind_speed= pd.NamedAgg(column='Wind Spd (km/h)', aggfunc=lambda x: (x.mean())),
)
trips_data_wa['rides'] = trips_data_wa['rides'].astype('Int64')
trips_data_wa['annual_members'] = trips_data_wa['annual_members'].astype('Int64')
trips_data_wa['casual_members'] = trips_data_wa['casual_members'].astype('Int64')
trips_data_wa['max temperature'] = trips_data_wa['max_temp']
trips_data_wa['min temperature'] = trips_data_wa['min_temp']
trips_data_wa['average relative humidity'] = trips_data_wa['rel_hum']
trips_data_wa['average wind speed'] = trips_data_wa['wind_speed']
trips_data_wa = trips_data_wa[['rides','annual_members','casual_members','max temperature','min temperature','average relative humidity','average wind speed']]

# View DataFrame
trips_data_wa.head(10)
Out[142]:
rides annual_members casual_members max temperature min temperature average relative humidity average wind speed
Start Time
2018-01-01 00:00:00-05:00 243 235 8 -8.0 -17.9 64.226337 24.152263
2018-01-02 00:00:00-05:00 953 934 19 -6.6 -13.3 73.600210 40.710388
2018-01-03 00:00:00-05:00 1181 1166 15 -5.0 -10.1 68.174428 34.339543
2018-01-04 00:00:00-05:00 1169 1153 16 -7.6 -18.7 62.354149 26.723695
2018-01-05 00:00:00-05:00 783 773 10 -15.2 -20.4 57.383142 23.342273
2018-01-06 00:00:00-05:00 398 390 8 -15.4 -21.9 56.402010 16.979899
2018-01-07 00:00:00-05:00 524 514 10 -1.2 -17.2 67.971374 41.295802
2018-01-08 00:00:00-05:00 1013 998 15 2.7 -1.2 83.671273 27.072063
2018-01-09 00:00:00-05:00 1890 1841 49 1.6 -1.0 66.428571 15.307407
2018-01-10 00:00:00-05:00 1877 1835 42 2.7 -1.4 85.392115 13.351092

As well as droping any 0 ridership data and NA

In [143]:
trips_data_wa_d=trips_data_wa[(trips_data_w[['rides']] != 0).all(axis=1)]
trips_data_wa_d = trips_data_wa_d.dropna()
trips_data_wa_d.head()
Out[143]:
rides annual_members casual_members max temperature min temperature average relative humidity average wind speed
Start Time
2018-01-01 00:00:00-05:00 243 235 8 -8.0 -17.9 64.226337 24.152263
2018-01-02 00:00:00-05:00 953 934 19 -6.6 -13.3 73.600210 40.710388
2018-01-03 00:00:00-05:00 1181 1166 15 -5.0 -10.1 68.174428 34.339543
2018-01-04 00:00:00-05:00 1169 1153 16 -7.6 -18.7 62.354149 26.723695
2018-01-05 00:00:00-05:00 783 773 10 -15.2 -20.4 57.383142 23.342273

Temperature influence

Now we can take a look at temperature influence on ridership through joint plot First looking at max temperature

In [144]:
plt.figure(figsize=(8, 5))
plt.title('Minimum Temperatue Influence on Ridership', fontsize=18)
kxma1= sns.scatterplot(data=trips_data_wa_d, x="max temperature", y="annual_members",label='Annual Members')
kxma2= sns.scatterplot(data=trips_data_wa_d, x="max temperature", y="casual_members",label='Casual Members')
plt.legend()
kxma1.set_xlabel('Maximum Daily Temperature',fontsize=16)
kxma2.set_ylabel('Rides per Day',fontsize=16)
plt.show()
In [145]:
plt.figure(figsize=(8, 5))
plt.title('Maximum Temperatue Influence on Ridership', fontsize=18)
kxmi1= sns.scatterplot(data=trips_data_wa_d, x="min temperature", y="annual_members",label='Annual Members')
kxmi2= sns.scatterplot(data=trips_data_wa_d, x="min temperature", y="casual_members",label='Casual Members')
plt.legend()
kxmi1.set_xlabel('Minimum Daily Temperature',fontsize=16)
kxmi2.set_ylabel('Rides per Day',fontsize=16)
plt.show()

Relative Humidity influence

After temperature, relative humidity is examined using scatter plot

In [146]:
plt.figure(figsize=(8, 5))
plt.title('Relative Humidity Influence on Ridership', fontsize=18)
lx1= sns.scatterplot(data=trips_data_wa_d, x="average relative humidity", y="annual_members",label='Annual Members')
lx2= sns.scatterplot(data=trips_data_wa_d, x="average relative humidity", y="casual_members",label='Casual Members')
plt.legend()
lx1.set_xlabel('Average Daily Relative Humidity',fontsize=16)
lx1.set_ylabel('Rides per Day',fontsize=16)
plt.show()

Wind Speed influence

Lastly, max wind speed per day is also examined:

In [147]:
plt.figure(figsize=(8, 5))
plt.title('Wind Speed Influence on Ridership', fontsize=18)
mx1= sns.scatterplot(data=trips_data_wa_d, x="average wind speed", y="annual_members",label='Annual Members')
mx2= sns.scatterplot(data=trips_data_wa_d, x="average wind speed", y="casual_members",label='Casual Members')
plt.legend()
mx1.set_xlabel('Average Daily Wind Speed',fontsize=16)
mx1.set_ylabel('Rides per Day',fontsize=16)
plt.show()

From the above scatter plots we can see that both temperature (whether max vs min ) and average wind speed has a significant impact on daily ridership foor both annual and casual members. Where with a higher temperature there will be more ridership, and on colder days the number of riders reduced. This is also evident for wind speed where with higher wind speed it may be more difficult to ride outside comparing with low wind speed.

On the contrary, relative humidity does not have a significant impact on ridership for both user groups which aligns with the expectation that bike riders are not concerned with relative humidity, only precipitation.

Pandemic Influence Analysis

This session will examine whether the pandemic has influence bike sharing behaviour

Import COVID Data from CIty of Toronto Database

In [148]:
pandemic_data = pd.read_csv('COVID19 cases Toronto.csv')
pandemic_data.head()
Out[148]:
_id Assigned_ID Outbreak Associated Age Group Neighbourhood Name FSA Source of Infection Classification Episode Date Reported Date Client Gender Outcome Currently Hospitalized Currently in ICU Currently Intubated Ever Hospitalized Ever in ICU Ever Intubated
0 461196 1 Sporadic 50 to 59 Years Willowdale East M2N Travel CONFIRMED 2020-01-22 2020-01-23 FEMALE RESOLVED No No No No No No
1 461197 2 Sporadic 50 to 59 Years Willowdale East M2N Travel CONFIRMED 2020-01-21 2020-01-23 MALE RESOLVED No No No Yes No No
2 461198 3 Sporadic 20 to 29 Years Parkwoods-Donalda M3A Travel CONFIRMED 2020-02-05 2020-02-21 FEMALE RESOLVED No No No No No No
3 461199 4 Sporadic 60 to 69 Years Church-Yonge Corridor M4W Travel CONFIRMED 2020-02-16 2020-02-25 FEMALE RESOLVED No No No No No No
4 461200 5 Sporadic 60 to 69 Years Church-Yonge Corridor M4W Travel CONFIRMED 2020-02-20 2020-02-26 MALE RESOLVED No No No No No No
In [149]:
pandemic_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100463 entries, 0 to 100462
Data columns (total 18 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   _id                     100463 non-null  int64 
 1   Assigned_ID             100463 non-null  int64 
 2   Outbreak Associated     100463 non-null  object
 3   Age Group               100393 non-null  object
 4   Neighbourhood Name      98734 non-null   object
 5   FSA                     99314 non-null   object
 6   Source of Infection     100463 non-null  object
 7   Classification          100463 non-null  object
 8   Episode Date            100463 non-null  object
 9   Reported Date           100463 non-null  object
 10  Client Gender           100463 non-null  object
 11  Outcome                 100463 non-null  object
 12  Currently Hospitalized  100463 non-null  object
 13  Currently in ICU        100463 non-null  object
 14  Currently Intubated     100463 non-null  object
 15  Ever Hospitalized       100463 non-null  object
 16  Ever in ICU             100463 non-null  object
 17  Ever Intubated          100463 non-null  object
dtypes: int64(2), object(16)
memory usage: 13.8+ MB

We will be using episode data as our date for comparison, but we need to convert to datetime first and localize

In [150]:
pandemic_data['Episode Date'] = pd.DatetimeIndex(pandemic_data['Episode Date']).tz_localize(tz = 'UTC').tz_convert(tz = 'EST')
In [151]:
pandemic_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100463 entries, 0 to 100462
Data columns (total 18 columns):
 #   Column                  Non-Null Count   Dtype              
---  ------                  --------------   -----              
 0   _id                     100463 non-null  int64              
 1   Assigned_ID             100463 non-null  int64              
 2   Outbreak Associated     100463 non-null  object             
 3   Age Group               100393 non-null  object             
 4   Neighbourhood Name      98734 non-null   object             
 5   FSA                     99314 non-null   object             
 6   Source of Infection     100463 non-null  object             
 7   Classification          100463 non-null  object             
 8   Episode Date            100463 non-null  datetime64[ns, EST]
 9   Reported Date           100463 non-null  object             
 10  Client Gender           100463 non-null  object             
 11  Outcome                 100463 non-null  object             
 12  Currently Hospitalized  100463 non-null  object             
 13  Currently in ICU        100463 non-null  object             
 14  Currently Intubated     100463 non-null  object             
 15  Ever Hospitalized       100463 non-null  object             
 16  Ever in ICU             100463 non-null  object             
 17  Ever Intubated          100463 non-null  object             
dtypes: datetime64[ns, EST](1), int64(2), object(15)
memory usage: 13.8+ MB

Checking NA values

In [152]:
pandemic_data.isnull().sum(axis=0).to_frame('count')
Out[152]:
count
_id 0
Assigned_ID 0
Outbreak Associated 0
Age Group 70
Neighbourhood Name 1729
FSA 1149
Source of Infection 0
Classification 0
Episode Date 0
Reported Date 0
Client Gender 0
Outcome 0
Currently Hospitalized 0
Currently in ICU 0
Currently Intubated 0
Ever Hospitalized 0
Ever in ICU 0
Ever Intubated 0

Now grouping the data, but before doing so let's examine what frequency is best using the bike share data first

In [153]:
# Using the above trips_data_day dataframe to evaluate trips data on a frequency of days
plt.figure(figsize=(10, 5))
plt.title('Daily ridership', fontsize=18)
rdx=sns.lineplot(data=trips_data_day, x='Start Time', y="rides",label='total rides')
rdx.set_xlabel('Day',fontsize=12)
rdx.set_ylabel('Rides per day',fontsize=12)
plt.show()

The above graph seems crowded with too many data points, thus a monthly frequency is adopted

In [154]:
trips_data_month = trips_data
trips_data_month=trips_data_month.groupby([pd.Grouper(key='Start Time', freq='M')]).agg(
    rides=pd.NamedAgg(column='User Type', aggfunc=lambda x: (x.count())),
    annual_members = pd.NamedAgg(column='User Type', aggfunc=lambda x: (x[x=='Annual Member'].count())),
    casual_members = pd.NamedAgg(column='User Type', aggfunc=lambda x: (x[x=='Casual Member'].count())))

trips_data_month['rides'] = trips_data_month['rides'].astype('Int64')
trips_data_month['annual_members'] = trips_data_month['annual_members'].astype('Int64')
trips_data_month['casual_members'] = trips_data_month['casual_members'].astype('Int64')
trips_data_month.head(13)
Out[154]:
rides annual_members casual_members
Start Time
2018-01-31 00:00:00-05:00 43556 42373 1183
2018-02-28 00:00:00-05:00 49099 47124 1975
2018-03-31 00:00:00-05:00 83494 78304 5190
2018-04-30 00:00:00-05:00 91183 81684 9499
2018-05-31 00:00:00-05:00 197652 159360 38292
2018-06-30 00:00:00-05:00 235862 184480 51382
2018-07-31 00:00:00-05:00 270033 212149 57884
2018-08-31 00:00:00-05:00 266038 206549 59489
2018-09-30 00:00:00-05:00 244404 204547 39857
2018-10-31 00:00:00-05:00 172611 159108 13503
2018-11-30 00:00:00-05:00 103489 100280 3209
2018-12-31 00:00:00-05:00 82657 80289 2368
2019-01-31 00:00:00-05:00 95080 77190 17890
In [155]:
plt.figure(figsize=(10, 5))
plt.title('Monthly ridership', fontsize=18)
rmx=sns.lineplot(data=trips_data_month, x='Start Time', y="rides",label='rides')
rmx.set_xlabel('Month',fontsize=12)
rmx.set_ylabel('Rides per month',fontsize=12)
plt.show()

This is more clear, therefore pandemic data will also be using a frequency of month

In [156]:
pandemic_data_month = pandemic_data
pandemic_data_month=pandemic_data_month.groupby([pd.Grouper(key='Episode Date', freq='M')]).agg(
    cases=pd.NamedAgg(column='Assigned_ID', aggfunc=lambda x: (x.count())))

pandemic_data_month['cases'] = pandemic_data_month['cases'].astype('Int64')
pandemic_data_month.head(10)
Out[156]:
cases
Episode Date
2020-01-31 00:00:00-05:00 9
2020-02-29 00:00:00-05:00 22
2020-03-31 00:00:00-05:00 1960
2020-04-30 00:00:00-05:00 5604
2020-05-31 00:00:00-05:00 4742
2020-06-30 00:00:00-05:00 2051
2020-07-31 00:00:00-05:00 818
2020-08-31 00:00:00-05:00 914
2020-09-30 00:00:00-05:00 5352
2020-10-31 00:00:00-05:00 8843

Since there are NA in the column above, let's check our dataframe and see if indeed there are dates where no patience recorded on that day

In [157]:
pandemic_data.sort_values(by='Episode Date', ascending=True,inplace = True)
pandemic_data.head(10)
Out[157]:
_id Assigned_ID Outbreak Associated Age Group Neighbourhood Name FSA Source of Infection Classification Episode Date Reported Date Client Gender Outcome Currently Hospitalized Currently in ICU Currently Intubated Ever Hospitalized Ever in ICU Ever Intubated
85153 546349 87895 Sporadic 20 to 29 Years Keelesdale-Eglinton West M6N No Information CONFIRMED 2020-01-20 19:00:00-05:00 2021-01-28 MALE RESOLVED No No No No No No
1 461197 2 Sporadic 50 to 59 Years Willowdale East M2N Travel CONFIRMED 2020-01-20 19:00:00-05:00 2020-01-23 MALE RESOLVED No No No Yes No No
25432 486628 26481 Sporadic 20 to 29 Years Bay Street Corridor M5G Community CONFIRMED 2020-01-21 19:00:00-05:00 2020-10-24 FEMALE RESOLVED No No No No No No
84578 545774 87307 Sporadic 40 to 49 Years West Humber-Clairville M9V Household Contact CONFIRMED 2020-01-21 19:00:00-05:00 2021-01-27 FEMALE RESOLVED No No No No No No
0 461196 1 Sporadic 50 to 59 Years Willowdale East M2N Travel CONFIRMED 2020-01-21 19:00:00-05:00 2020-01-23 FEMALE RESOLVED No No No No No No
26955 488151 28045 Sporadic 40 to 49 Years Rosedale-Moore Park M4T No Information CONFIRMED 2020-01-23 19:00:00-05:00 2020-10-26 MALE RESOLVED No No No No No No
86870 548066 89638 Sporadic 70 to 79 Years Rockcliffe-Smythe M6N Community CONFIRMED 2020-01-24 19:00:00-05:00 2021-01-30 FEMALE RESOLVED No No No No No No
91058 552254 94008 Sporadic 19 and younger Wexford/Maryvale M1R Community CONFIRMED 2020-01-24 19:00:00-05:00 2021-02-09 FEMALE RESOLVED No No No No No No
88856 550052 91681 Sporadic 20 to 29 Years Corso Italia-Davenport M6E Community CONFIRMED 2020-01-27 19:00:00-05:00 2021-02-02 MALE RESOLVED No No No No No No
2 461198 3 Sporadic 20 to 29 Years Parkwoods-Donalda M3A Travel CONFIRMED 2020-02-04 19:00:00-05:00 2020-02-21 FEMALE RESOLVED No No No No No No

The above verified that there are dates without episode date, suggesting no patient has been recorded, so we will change NA to 0

In [158]:
pandemic_data_month['cases'].fillna(0, inplace=True)
pandemic_data_month.head(100)
Out[158]:
cases
Episode Date
2020-01-31 00:00:00-05:00 9
2020-02-29 00:00:00-05:00 22
2020-03-31 00:00:00-05:00 1960
2020-04-30 00:00:00-05:00 5604
2020-05-31 00:00:00-05:00 4742
2020-06-30 00:00:00-05:00 2051
2020-07-31 00:00:00-05:00 818
2020-08-31 00:00:00-05:00 914
2020-09-30 00:00:00-05:00 5352
2020-10-31 00:00:00-05:00 8843
2020-11-30 00:00:00-05:00 15309
2020-12-31 00:00:00-05:00 22186
2021-01-31 00:00:00-05:00 21853
2021-02-28 00:00:00-05:00 9318
2021-03-31 00:00:00-05:00 1482

Let's take a brief look at what the pandemic looks like

In [159]:
plt.figure(figsize=(7, 5))
plt.title('Toronto Monthly Pandemic cases', fontsize=18)
nx= sns.lineplot(data=pandemic_data_month, x='Episode Date', y="cases")
nx.set_xlabel('Date',fontsize=12)
nx.set_ylabel('Cases per month',fontsize=12)
plt.show()

Data Merging between Bike Share Data and Toronto Pandemic Data

In this section, we will be using the previously developed dataframe with daily ridership after 2018 and joining with Toronto Pandemic Data

In [160]:
# will be using trips_data_day dataframe developed as part of the general analysis from above 
#reset both index and create column called merge_time
trips_data_month_m=trips_data_month
trips_data_month['merge_time'] = trips_data_month.index
trips_data_month=trips_data_month.reset_index(drop=True)
pandemic_data_month_m = pandemic_data_month
pandemic_data_month_m['merge_time']= pandemic_data_month_m.index
pandemic_data_month_m=pandemic_data_month_m.reset_index(drop=True)

# merging two dataframes 
bike_pandemic_data = pd.merge(trips_data_month_m, pandemic_data_month, how ='left', on=['merge_time'])
bike_pandemic_data.fillna(0, inplace=True)

bike_pandemic_data.head()
Out[160]:
rides annual_members casual_members merge_time cases
0 43556 42373 1183 2018-01-31 00:00:00-05:00 0
1 49099 47124 1975 2018-02-28 00:00:00-05:00 0
2 83494 78304 5190 2018-03-31 00:00:00-05:00 0
3 91183 81684 9499 2018-04-30 00:00:00-05:00 0
4 197652 159360 38292 2018-05-31 00:00:00-05:00 0
In [161]:
plt.figure(figsize=(10, 5))
plt.title('Comparing Bike Share Data with Pandemic Cases per Month', fontsize=18)
bpm=sns.lineplot(data=bike_pandemic_data, x='merge_time', y='rides', label = 'Rides')
bpm=sns.lineplot(data=bike_pandemic_data, x='merge_time', y='cases',label = 'Cases')
bpm.set_xlabel('Date',fontsize=12)
bpm.set_ylabel('Rides/cases per month',fontsize=12)
plt.show()

The above graph does not suggest much correlation regarding the impact pandemic has on ridership, however, since the data is not scaled, we will now scale the data for a better visualization

Data Scaling

In [162]:
bike_pandemic_data_scale=bike_pandemic_data
bike_pandemic_data_scale.index= bike_pandemic_data_scale['merge_time']
bike_pandemic_data_scale=bike_pandemic_data_scale.drop(columns=['merge_time'])
bike_pandemic_data_scale.head()
Out[162]:
rides annual_members casual_members cases
merge_time
2018-01-31 00:00:00-05:00 43556 42373 1183 0
2018-02-28 00:00:00-05:00 49099 47124 1975 0
2018-03-31 00:00:00-05:00 83494 78304 5190 0
2018-04-30 00:00:00-05:00 91183 81684 9499 0
2018-05-31 00:00:00-05:00 197652 159360 38292 0
In [163]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
bike_pandemic_data_scale_mm = bike_pandemic_data_scale
bike_pandemic_data_scale_mm[['rides', 'annual_members','casual_members','cases']] = scaler.fit_transform(bike_pandemic_data_scale_mm[['rides', 'annual_members','casual_members','cases']])
bike_pandemic_data_scale_mm.head()
Out[163]:
rides annual_members casual_members cases
merge_time
2018-01-31 00:00:00-05:00 0.000000 0.000000 0.000000 0.0
2018-02-28 00:00:00-05:00 0.016774 0.024370 0.005824 0.0
2018-03-31 00:00:00-05:00 0.120857 0.184308 0.029463 0.0
2018-04-30 00:00:00-05:00 0.144125 0.201646 0.061147 0.0
2018-05-31 00:00:00-05:00 0.466312 0.600084 0.272860 0.0
In [164]:
plt.figure(figsize=(10, 5))

plt.title('Comparing Bike Share Data with Pandemic Cases per Month', fontsize=18)
bpmmm=sns.lineplot(data=bike_pandemic_data_scale_mm, x='merge_time', y='rides', label = 'Rides')
bpmmm =sns.lineplot(data=bike_pandemic_data_scale_mm, x='merge_time', y='cases',label = 'Cases')
bpmmm.set_xlabel('Date',fontsize=12)
bpmmm.set_ylabel('Rides/cases per month',fontsize=12)
plt.show()

From the above graph, we can see that during the start of the pandemic (Between 2020 Feb to 2020 May) there is a decrease in ridership, however, after May of 2020 we can see that as the pandemic cases go down, ridership increased until 2020 September, where the second wave hit Toronto thus cases went up with a decrease in ridership.

Ihis initial oberservation could imply that pandemic has influenced overall ridership, however, upon further examination with previous year data, we can also see that for 2018 and 2019 without pandemic, between May to September, there is an increase in ridership possibly due to summer weather conditions, this trend decreases around September due to winter conditions.

With both factors considered we can conclude that pandemic indeed has influenced ridership, with a steeper rate of increase when pandemic rate decreased and vice versa, however, the extend of this effect can be argued as not as significant due to weather conditions influencing user behaviour.

Stage 2: Temporal Analysis

Casual User Vs. Annual Users

Import raw Cycling Data

In [165]:
trips_raw = pd.read_csv('trips_raw_data.csv')

Explore data types and rows in trips_raw_data.csv

In [166]:
trips_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8039088 entries, 0 to 8039087
Data columns (total 24 columns):
 #   Column               Dtype  
---  ------               -----  
 0   Unnamed: 0           int64  
 1   Trip Id              int64  
 2   Subscription Id      float64
 3   Trip Duration        int64  
 4   Start Station Id     float64
 5   Start Time           object 
 6   Start Station Name   object 
 7   End Station Id       float64
 8   End Time             object 
 9   End Station Name     object 
 10  Bike Id              float64
 11  User Type            object 
 12  merge_time           object 
 13  Date/Time            object 
 14  Temp (°C)            float64
 15  Dew Point Temp (°C)  float64
 16  Rel Hum (%)          float64
 17  Wind Dir (10s deg)   float64
 18  Wind Spd (km/h)      float64
 19  Visibility (km)      float64
 20  Stn Press (kPa)      float64
 21  Hmdx                 float64
 22  Wind Chill           float64
 23  Weather              object 
dtypes: float64(13), int64(3), object(8)
memory usage: 1.4+ GB
In [167]:
trips_raw.head()
Out[167]:
Unnamed: 0 Trip Id Subscription Id Trip Duration Start Station Id Start Time Start Station Name End Station Id End Time End Station Name ... Temp (°C) Dew Point Temp (°C) Rel Hum (%) Wind Dir (10s deg) Wind Spd (km/h) Visibility (km) Stn Press (kPa) Hmdx Wind Chill Weather
0 58 712441 NaN 274 7006.0 2017-01-01 00:03:00-05:00 Bay St / College St (East Side) 7021.0 2017-01-01 00:08:00-05:00 Bay St / Albert St ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
1 59 712442 NaN 538 7046.0 2017-01-01 00:03:00-05:00 Niagara St / Richmond St W 7147.0 2017-01-01 00:12:00-05:00 King St W / Fraser Ave ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
2 60 712443 NaN 992 7048.0 2017-01-01 00:05:00-05:00 Front St W / Yonge St (Hockey Hall of Fame) 7089.0 2017-01-01 00:22:00-05:00 Church St / Wood St ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
3 61 712444 NaN 1005 7177.0 2017-01-01 00:09:00-05:00 East Liberty St / Pirandello St 7202.0 2017-01-01 00:26:00-05:00 Queen St W / York St (City Hall) ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
4 62 712445 NaN 645 7203.0 2017-01-01 00:14:00-05:00 Bathurst St/Queens Quay(Billy Bishop Airport) 7010.0 2017-01-01 00:25:00-05:00 King St W / Spadina Ave ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN

5 rows × 24 columns

Cleaning Data

Removing all row except Trip ID, Trip Duration, Start Time and User Type

In [168]:
trips_raw = trips_raw[['Trip Id', 'Trip Duration', 'Start Time', 'User Type']]

Converting Start Time Column to a datetime format

In [169]:
trips_raw ['Start Time'] = pd.DatetimeIndex(trips_raw ['Start Time'])

Analyzing how many different user types there are

In [170]:
unique_subscriptions = trips_raw['User Type'].unique()
unique_subscriptions
Out[170]:
array([nan, 'Annual Member', 'Casual Member'], dtype=object)

Removing all nan from the User Type columns

In [171]:
trips_raw = trips_raw[~trips_raw['User Type'].isna()]

Determinging the time range for the data

In [172]:
min_date = trips_raw['Start Time'].min()
max_date = trips_raw['Start Time'].max()
print(min_date)
print(max_date)
2018-01-01 00:47:00-05:00
2020-12-10 23:59:00-05:00

Looks like before 2018-01-01, user types were not tracked. Also, last 21 days of December 2020 are missing from the data set. December will be removed from reach year to ensure we are comparing the same amounts of time.

In [224]:
trips_raw = trips_raw[trips_raw['Start Time'].dt.month!=12]

Split data set into casual and annual users

In [225]:
casual_user = trips_raw[trips_raw['User Type'] == 'Casual Member']
annual_user = trips_raw[trips_raw['User Type'] == 'Annual Member']

Grouping data by year

In [226]:
casual_user['Year'] = casual_user['Start Time'].dt.year
annual_user['Year'] = annual_user['Start Time'].dt.year
In [227]:
number_casual_users = casual_user.groupby(['Year']).size()
number_annual_users = annual_user.groupby(['Year']).size()
In [228]:
number_casual_users
Out[228]:
Year
2018    281463
2019    487620
2020    805593
dtype: int64
In [229]:
number_annual_users
Out[229]:
Year
2018    1475958
2019    1733834
2020    1577893
dtype: int64

Plotting Data

In [231]:
fig, ax = plt.subplots()

p1 = ax.barh(np.arange(3)+0.4, number_casual_users, 0.4, align = 'center')
p2 = ax.barh(np.arange(3), number_annual_users, 0.4, align = 'center')

ax.set_yticklabels(number_casual_users.index)

ax.set_yticks(np.arange(3)+0.4/2)

ax.xaxis.grid(True)
ax.yaxis.grid(False)

ax.set_xlabel('Numebr of Trips (January to November)')
ax.set_ylabel('Year')
ax.set_title('Number of Casual User Trips vs. Annual User Trips from 2018 to 2020')

fig.set_size_inches(8,5)

ax.legend((p1[0], p2[0]), ('Casual Users', 'Annual Users'), 
          loc='best')
Out[231]:
<matplotlib.legend.Legend at 0x7fc4ce09e190>

The amount of casual users has been steadily increasing. The amount of annual users has been fluctuating.

How Popular are Free Ride Wednesday

Import raw cycling Data

In [180]:
trips = pd.read_csv('trips_raw_data.csv')
trips_raw = trips.copy()

Explore data types and rows in trips_raw_data.csv

In [181]:
trips_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8039088 entries, 0 to 8039087
Data columns (total 24 columns):
 #   Column               Dtype  
---  ------               -----  
 0   Unnamed: 0           int64  
 1   Trip Id              int64  
 2   Subscription Id      float64
 3   Trip Duration        int64  
 4   Start Station Id     float64
 5   Start Time           object 
 6   Start Station Name   object 
 7   End Station Id       float64
 8   End Time             object 
 9   End Station Name     object 
 10  Bike Id              float64
 11  User Type            object 
 12  merge_time           object 
 13  Date/Time            object 
 14  Temp (°C)            float64
 15  Dew Point Temp (°C)  float64
 16  Rel Hum (%)          float64
 17  Wind Dir (10s deg)   float64
 18  Wind Spd (km/h)      float64
 19  Visibility (km)      float64
 20  Stn Press (kPa)      float64
 21  Hmdx                 float64
 22  Wind Chill           float64
 23  Weather              object 
dtypes: float64(13), int64(3), object(8)
memory usage: 1.4+ GB
In [182]:
trips_raw.head()
Out[182]:
Unnamed: 0 Trip Id Subscription Id Trip Duration Start Station Id Start Time Start Station Name End Station Id End Time End Station Name ... Temp (°C) Dew Point Temp (°C) Rel Hum (%) Wind Dir (10s deg) Wind Spd (km/h) Visibility (km) Stn Press (kPa) Hmdx Wind Chill Weather
0 58 712441 NaN 274 7006.0 2017-01-01 00:03:00-05:00 Bay St / College St (East Side) 7021.0 2017-01-01 00:08:00-05:00 Bay St / Albert St ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
1 59 712442 NaN 538 7046.0 2017-01-01 00:03:00-05:00 Niagara St / Richmond St W 7147.0 2017-01-01 00:12:00-05:00 King St W / Fraser Ave ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
2 60 712443 NaN 992 7048.0 2017-01-01 00:05:00-05:00 Front St W / Yonge St (Hockey Hall of Fame) 7089.0 2017-01-01 00:22:00-05:00 Church St / Wood St ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
3 61 712444 NaN 1005 7177.0 2017-01-01 00:09:00-05:00 East Liberty St / Pirandello St 7202.0 2017-01-01 00:26:00-05:00 Queen St W / York St (City Hall) ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
4 62 712445 NaN 645 7203.0 2017-01-01 00:14:00-05:00 Bathurst St/Queens Quay(Billy Bishop Airport) 7010.0 2017-01-01 00:25:00-05:00 King St W / Spadina Ave ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN

5 rows × 24 columns

Cleaning Data

Removing all row except Trip ID, Trip Duration, Start Time and User Type

In [183]:
trips_raw = trips_raw[['Trip Id', 'Trip Duration', 'Start Time', 'User Type']]

Converting Start Time Column to a datetime format

In [184]:
trips_raw ['Start Time'] = pd.DatetimeIndex(trips_raw ['Start Time'])

Analyzing how many different user types there are

In [185]:
unique_subscriptions = trips_raw['User Type'].unique()
unique_subscriptions
Out[185]:
array([nan, 'Annual Member', 'Casual Member'], dtype=object)

Removing all nan from the User Type columns

In [186]:
trips_raw = trips_raw[~trips_raw['User Type'].isna()]

Determinging the time range for the data

In [187]:
min_date = trips_raw['Start Time'].min()
max_date = trips_raw['Start Time'].max()
print(min_date)
print(max_date)
2018-01-01 00:47:00-05:00
2020-12-10 23:59:00-05:00

Looks like before 2018-01-01, user types were not tracked. Also, last 21 days of December 2020 are missing from the data set. Free ride Wednesdays are only during September. Only keep data from September of each year

In [188]:
trips_raw = trips_raw[trips_raw['Start Time'].dt.month==9]

Split data set into casual and annual users

In [189]:
casual_user = trips_raw[trips_raw['User Type'] == 'Casual Member']
annual_user = trips_raw[trips_raw['User Type'] == 'Annual Member']

Grouping data by weekday

In [190]:
casual_user['Day'] = casual_user['Start Time'].dt.weekday
annual_user['Day'] = annual_user['Start Time'].dt.weekday
In [191]:
number_casual_users = casual_user.groupby(['Day']).size()
number_annual_users = annual_user.groupby(['Day']).size()
In [192]:
number_casual_users
Out[192]:
Day
0    22686
1    22984
2    30673
3    16672
4    20553
5    39457
6    43379
dtype: int64
In [193]:
number_annual_users
Out[193]:
Day
0     95490
1    103538
2     99884
3     94461
4     87750
5     76382
6     84575
dtype: int64

Plotting Data

In [194]:
fig, ax = plt.subplots()

p1 = ax.barh(np.arange(7)+0.4, number_casual_users/1000, 0.4, align = 'center')
p2 = ax.barh(np.arange(7), number_annual_users/1000, 0.4, align = 'center')

ax.set_yticklabels(number_casual_users.index)

ax.set_yticks(np.arange(7)+0.4/2)

ax.xaxis.grid(True)
ax.yaxis.grid(False)

Weekdays = ['Mon', 'Tue', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']

plt.yticks(range(7), Weekdays)

ax.set_xlabel('Numebr of Trips (1000s)')
ax.set_ylabel('Weekday')
ax.set_title('Free Ride Wednesday Popularity, 2017 - 2020 September Only, Casual Vs. Annual Users')

ax.legend((p1[0], p2[0]), ('Casual Users', 'Annual Users'), 
          loc='best')
Out[194]:
<matplotlib.legend.Legend at 0x7fc44c894d90>

There is a spike on free ride wednesday. However, the weekends are still more popular for casual users. Annual users are most likey using the bike share for their commutes, rather than for leasure (like the casual users).

Ride Distribution Over Year, Week and Day

Cleaning Data

Removing all row except Trip ID, Trip Duration, Start Time and User Type

In [195]:
trips_raw = trips # copied from before

trips_raw = trips_raw[['Trip Id', 'Trip Duration', 'Start Time', 'User Type']]

Converting Start Time Column to a datetime format

In [196]:
trips_raw ['Start Time'] = pd.DatetimeIndex(trips_raw ['Start Time'])

Analyzing how many different user types there are

In [197]:
unique_subscriptions = trips_raw['User Type'].unique()
unique_subscriptions
Out[197]:
array([nan, 'Annual Member', 'Casual Member'], dtype=object)

Removing all nan from the User Type columns

In [198]:
trips_raw = trips_raw[~trips_raw['User Type'].isna()]

Determinging the time range for the data

In [199]:
min_date = trips_raw['Start Time'].min()
max_date = trips_raw['Start Time'].max()
print(min_date)
print(max_date)
2018-01-01 00:47:00-05:00
2020-12-10 23:59:00-05:00

Looks like before 2018-01-01, user types were not tracked. Also, last 21 days of December 2020 are missing from the data set. Free ride Wednesdays are only during September. Only keep data from September of each year

In [200]:
trips_raw = trips_raw[trips_raw['Start Time'].dt.month!=12]

Split data set into casual and annual users

In [201]:
casual_user = trips_raw[trips_raw['User Type'] == 'Casual Member']
annual_user = trips_raw[trips_raw['User Type'] == 'Annual Member']

Adding column to dataframe to indicate the year, month, week, weekday and hour the trip took place

In [202]:
casual_user['Year'] = casual_user['Start Time'].dt.year
annual_user['Year'] = annual_user['Start Time'].dt.year
casual_user['Month'] = casual_user['Start Time'].dt.month
annual_user['Month'] = annual_user['Start Time'].dt.month
casual_user['Week'] = casual_user['Start Time'].dt.week
annual_user['Week'] = annual_user['Start Time'].dt.week
casual_user['Weekday'] = casual_user['Start Time'].dt.weekday
annual_user['Weekday'] = annual_user['Start Time'].dt.weekday
casual_user['Day'] = casual_user['Start Time'].dt.day
annual_user['Day'] = annual_user['Start Time'].dt.day
casual_user['Hour'] = casual_user['Start Time'].dt.hour
annual_user['Hour'] = annual_user['Start Time'].dt.hour
In [203]:
casual_user.head()
Out[203]:
Trip Id Trip Duration Start Time User Type Year Month Week Weekday Day Hour
1392046 2383664 797 2018-01-01 02:25:00-05:00 Casual Member 2018 1 1 0 1 2
1392060 2383678 1216 2018-01-01 03:46:00-05:00 Casual Member 2018 1 1 0 1 3
1392127 2383803 339 2018-01-01 12:53:00-05:00 Casual Member 2018 1 1 0 1 12
1392161 2383864 1318 2018-01-01 14:57:00-05:00 Casual Member 2018 1 1 0 1 14
1392162 2383865 1293 2018-01-01 14:57:00-05:00 Casual Member 2018 1 1 0 1 14
In [204]:
annual_user.head()
Out[204]:
Trip Id Trip Duration Start Time User Type Year Month Week Weekday Day Hour
1392031 2383648 393 2018-01-01 00:47:00-05:00 Annual Member 2018 1 1 0 1 0
1392032 2383649 625 2018-01-01 00:52:00-05:00 Annual Member 2018 1 1 0 1 0
1392033 2383650 233 2018-01-01 00:55:00-05:00 Annual Member 2018 1 1 0 1 0
1392034 2383651 1138 2018-01-01 00:57:00-05:00 Annual Member 2018 1 1 0 1 0
1392035 2383652 703 2018-01-01 01:00:00-05:00 Annual Member 2018 1 1 0 1 1

Plotting Data (weekly distribution over the year)

Grouping data by week over 2018 to 2020. Dividing by 3 to get average number of trips for that week of the year.

In [205]:
number_casual_users_month = (casual_user.groupby(['Month']).size())/3
number_annual_users_month = (annual_user.groupby(['Month']).size())/3

Plotting weekly distribution (average week over the 3 years)

In [206]:
fig, ax = plt.subplots()

p1 = ax.plot(number_casual_users_month/1000,  linewidth = 3)
p2 = ax.plot(number_annual_users_month/1000,  linewidth = 3)

ax.yaxis.grid(True)
ax.xaxis.grid(False)

Months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec',]

plt.xticks(range(12), Months)

ax.set_ylabel('Numebr of Trips (1000s)')
ax.set_xlabel('Week of the Year')
ax.set_title('Average Monthly Users from 2018 - 2020')

ax.legend((p1[0],p2[0]), ('Casual Users','Annual Users'), 
          loc='best')
Out[206]:
<matplotlib.legend.Legend at 0x7fc15beebc90>

Plotting Data (daily distribution over the week)

Grouping daily data over the 7 days of the week

In [207]:
number_casual_users_day = (casual_user.groupby(['Weekday']).size())/3
number_annual_users_day = (annual_user.groupby(['Weekday']).size())/3

Plotting daily distribution over the week (average day over the 3 years)

In [208]:
fig, ax = plt.subplots()

p1 = ax.plot(number_casual_users_day/1000,  linewidth = 3)
p2 = ax.plot(number_annual_users_day/1000,  linewidth = 3)

ax.yaxis.grid(True)
ax.xaxis.grid(False)

Weekdays = ['Mon', 'Tue', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']

plt.xticks(range(7), Weekdays)

ax.set_ylabel('Numebr of Trips (1000s)')
ax.set_xlabel('Day of the Week')
ax.set_title('Average Daily Users over the Week from 2018 - 2020')

ax.legend((p1[0],p2[0]), ('Casual Users','Annual Users'), 
          loc='best')
Out[208]:
<matplotlib.legend.Legend at 0x7fc15bebb050>

Plotting Data (daily distribution over the week)

Grouping hourly data over the day

In [209]:
number_casual_users_Hour = (casual_user.groupby(['Hour']).size())/3
number_annual_users_Hour = (annual_user.groupby(['Hour']).size())/3

Plotting hourly distribution over the day (average day over the 3 years)

In [221]:
fig, ax = plt.subplots()

p1 = ax.plot(number_casual_users_Hour/1000,  linewidth = 3)
p2 = ax.plot(number_annual_users_Hour/1000,  linewidth = 3)

ax.yaxis.grid(True)
ax.xaxis.grid(False)


ax.set_ylabel('Numebr of Trips (1000s)')
ax.set_xlabel('Hour of the Day')
ax.set_title('Average Hourly Users over the Day from 2018 - 2020')

ax.legend((p1[0],p2[0]), ('Casual Users','Annual Users'), 
          loc='best')
fig.set_size_inches(8,5)
plt.show()

Plotting Holiday Statutory Holidays Impact Demand

Since most of the cycling demand is in the summner months only holidays in May, June, July and August will be analyzied. Additionally, we will only be looking at holidays in 2019.

In [211]:
# Remove all years except 2019
casual_users_may_to_aug_2019 = casual_user[casual_user['Year']==2019]
annual_users_may_to_aug_2019 = annual_user[annual_user['Year']==2019]

# Remove all months except May
casual_users_may_2019 = casual_users_may_to_aug_2019[casual_users_may_to_aug_2019['Month'] == 5]
annual_users_may_2019 = annual_users_may_to_aug_2019[annual_users_may_to_aug_2019['Month'] == 5]

# Remove all months except June
casual_users_jun_2019 = casual_users_may_to_aug_2019[casual_users_may_to_aug_2019['Month'] == 6]
annual_users_jun_2019 = annual_users_may_to_aug_2019[annual_users_may_to_aug_2019['Month'] == 6]

# Remove all months except July
casual_users_jul_2019 = casual_users_may_to_aug_2019[casual_users_may_to_aug_2019['Month'] == 7]
annual_users_jul_2019 = annual_users_may_to_aug_2019[annual_users_may_to_aug_2019['Month'] == 7]

# Remove all months except Aug
casual_users_aug_2019 = casual_users_may_to_aug_2019[casual_users_may_to_aug_2019['Month'] == 8]
annual_users_aug_2019 = annual_users_may_to_aug_2019[annual_users_may_to_aug_2019['Month'] == 8]

Holiday Demand in May 2019

Grouping daily data over May

In [212]:
number_casual_users_day = casual_users_may_2019.groupby(['Day']).size()
number_annual_users_day = annual_users_may_2019.groupby(['Day']).size()

Plotting daily data over May

In [213]:
fig, ax = plt.subplots()

p1 = ax.plot(number_casual_users_day/1000,  linewidth = 3)
p2 = ax.plot(number_annual_users_day/1000,  linewidth = 3)

mohters_day = plt.axvline(x = 12, color = 'red')
vitoria_day = plt.axvline(x = 20, color = 'green')

ax.yaxis.grid(True)
ax.xaxis.grid(False)


ax.set_ylabel('Numebr of Trips (1000s)')
ax.set_xlabel('Day of Month')
ax.set_title('Daily Trips in May 2019')

ax.legend((p1[0],p2[0], mohters_day, vitoria_day), ('Casual Users','Annual Users', 'Mothers Day', 'Victoria Day'), 
          loc='best')
Out[213]:
<matplotlib.legend.Legend at 0x7fc15c4acf90>

Holiday Demand in June 2019

Grouping daily data over June

In [214]:
number_casual_users_day = casual_users_jun_2019.groupby(['Day']).size()
number_annual_users_day = annual_users_jun_2019.groupby(['Day']).size()

Plotting daily data over June

In [215]:
fig, ax = plt.subplots()

p1 = ax.plot(number_casual_users_day/1000,  linewidth = 3)
p2 = ax.plot(number_annual_users_day/1000,  linewidth = 3)

fathers_day = plt.axvline(x = 16, color = 'red')


ax.yaxis.grid(True)
ax.xaxis.grid(False)


ax.set_ylabel('Numebr of Trips (1000s)')
ax.set_xlabel('Day of Month')
ax.set_title('Daily Trips in June 2019')

ax.legend((p1[0],p2[0], fathers_day), ('Casual Users','Annual Users', 'Fathers Day'), 
          loc='best')
Out[215]:
<matplotlib.legend.Legend at 0x7fc1182d3ad0>

Holiday Demand in July 2019

Grouping daily data over July

In [216]:
number_casual_users_day = casual_users_jul_2019.groupby(['Day']).size()
number_annual_users_day = annual_users_jul_2019.groupby(['Day']).size()

Plotting daily data over July

In [217]:
fig, ax = plt.subplots()

p1 = ax.plot(number_casual_users_day/1000,  linewidth = 3)
p2 = ax.plot(number_annual_users_day/1000,  linewidth = 3)

canada_day = plt.axvline(x = 1, color = 'red')


ax.yaxis.grid(True)
ax.xaxis.grid(False)


ax.set_ylabel('Numebr of Trips (1000s)')
ax.set_xlabel('Day of Month')
ax.set_title('Daily Trips in July 2019')

ax.legend((p1[0],p2[0], canada_day), ('Casual Users','Annual Users', 'Canada Day'), 
          loc='best')
Out[217]:
<matplotlib.legend.Legend at 0x7fc11bb02fd0>

Spatial Analysis

Neighbourhoods Analysis

In [104]:
plt.rcParams['figure.dpi'] = 450
In [105]:
trips_raw = pd.read_csv('trips_raw_data.csv')
In [106]:
trips_raw.head()
Out[106]:
Unnamed: 0 Trip Id Subscription Id Trip Duration Start Station Id Start Time Start Station Name End Station Id End Time End Station Name ... Temp (°C) Dew Point Temp (°C) Rel Hum (%) Wind Dir (10s deg) Wind Spd (km/h) Visibility (km) Stn Press (kPa) Hmdx Wind Chill Weather
0 58 712441 NaN 274 7006.0 2017-01-01 00:03:00-05:00 Bay St / College St (East Side) 7021.0 2017-01-01 00:08:00-05:00 Bay St / Albert St ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
1 59 712442 NaN 538 7046.0 2017-01-01 00:03:00-05:00 Niagara St / Richmond St W 7147.0 2017-01-01 00:12:00-05:00 King St W / Fraser Ave ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
2 60 712443 NaN 992 7048.0 2017-01-01 00:05:00-05:00 Front St W / Yonge St (Hockey Hall of Fame) 7089.0 2017-01-01 00:22:00-05:00 Church St / Wood St ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
3 61 712444 NaN 1005 7177.0 2017-01-01 00:09:00-05:00 East Liberty St / Pirandello St 7202.0 2017-01-01 00:26:00-05:00 Queen St W / York St (City Hall) ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN
4 62 712445 NaN 645 7203.0 2017-01-01 00:14:00-05:00 Bathurst St/Queens Quay(Billy Bishop Airport) 7010.0 2017-01-01 00:25:00-05:00 King St W / Spadina Ave ... 1.5 -3.6 69.0 26.0 39.0 16.1 99.81 NaN NaN NaN

5 rows × 24 columns

Bike Share Trips Origin and Destination

Lets see which neighbourhoods have the highest departures and arrivals from the bike share data.

Merging Datasets

First, lets use the Neighbourhoods file from the City of Toronto Open Data. The City has 140 designated neighbourhoods, and they are composed of a couple census tracts.

We're using the geojson version since that is easier to work with than managing multiple files needed for the a shapefile.

In [108]:
neighbourhoods = gpd.read_file('Neighbourhoods.geojson')
neighbourhoods.head(5)
Out[108]:
_id AREA_ID AREA_ATTR_ID PARENT_AREA_ID AREA_SHORT_CODE AREA_LONG_CODE AREA_NAME AREA_DESC X Y LONGITUDE LATITUDE OBJECTID Shape__Area Shape__Length CLASSIFICATION CLASSIFICATION_CODE geometry
0 11201 2480141 26005521 None 96 96 Casa Loma (96) Casa Loma (96) None None None None 17545105 3.678385e+06 8214.176485 None None POLYGON ((-79.41469 43.67391, -79.41485 43.674...
1 11202 2480140 26005520 None 95 95 Annex (95) Annex (95) None None None None 17545121 5.337192e+06 10513.883143 None None POLYGON ((-79.39414 43.66872, -79.39588 43.668...
2 11203 2480139 26005519 None 109 109 Caledonia-Fairbank (109) Caledonia-Fairbank (109) None None None None 17545137 2.955857e+06 6849.911724 None None POLYGON ((-79.46021 43.68156, -79.46044 43.681...
3 11204 2480064 26005444 None 64 64 Woodbine Corridor (64) Woodbine Corridor (64) None None None None 17545153 3.052518e+06 7512.966773 None None POLYGON ((-79.31485 43.66674, -79.31660 43.666...
4 11205 2480063 26005443 None 103 103 Lawrence Park South (103) Lawrence Park South (103) None None None None 17545169 6.211341e+06 13530.370002 None None POLYGON ((-79.41096 43.70408, -79.41165 43.703...

In this step, I'll rename some columns, and remove the neighbourhood id from the name string using some regex commands. We basically are removing cases where there are 1, 2, or 3 numerical digits from the string.

We'll still keep the the neighbourhood id, as its better to use that as our unique identifier.

In [109]:
neighbourhoods = neighbourhoods.rename(columns = {'AREA_NAME':'name', 'AREA_SHORT_CODE':'area_short'})

neighbourhoods['name'] = neighbourhoods['name'].str.replace(' \(\d\d\)', '').str.replace(' \(\d\d\d\)', '').str.replace(' \(\d\)', '')
neighbourhoods = neighbourhoods[['area_short','name', 'geometry']]

In a spreadsheet, I went ahead and manually classified each neighbourhood into a higher level of geography. I'm using the pre-1998 Metro Toronto Borough/City boundaries to classify neighbourhoods into North York, Scarborough, York, East York, and Etobicoke. This process was easy, since neighbourhood boundaries (and census tracts) line up with the former Metro Toronto boundaries, and the area_short parameter is numerically grouped into the former cities/boroughs.

I further subdivided neighbourhoods located in the former Old Toronto. For Downtown, I used the TOcore definition of Downtown. From their, the East End is everything east of the Don River, the West End is generally west of Bathurst, and the North End is generally everything north of Bloor Street. The Annex, and Wychwoods neighbourhoods were placed in the North End, and those neighbourhoods formed the boundary between the North and West ends.

In [19]:
neighbourhoods = neighbourhoods.merge(pd.read_csv('neighbourhood_groupings.csv'))
neighbourhoods.head()
Out[19]:
area_short name geometry Borough
0 96 Casa Loma POLYGON ((-79.41469 43.67391, -79.41485 43.674... North End
1 95 Annex POLYGON ((-79.39414 43.66872, -79.39588 43.668... North End
2 109 Caledonia-Fairbank POLYGON ((-79.46021 43.68156, -79.46044 43.681... York
3 64 Woodbine Corridor POLYGON ((-79.31485 43.66674, -79.31660 43.666... East End
4 103 Lawrence Park South POLYGON ((-79.41096 43.70408, -79.41165 43.703... North End

To make the maps look more recognizable to the average person, and for easy distance calculations, we're going to use a mercator projection. EPSG = 26917 corresponds to UTM zone 17N, which is the associated zone for Toronto.

In [20]:
neighbourhoods = neighbourhoods.to_crs(epsg = '26917')

We'll also rotate our geography 17 degrees, so that Yonge Street is a vertical line. This also makes it easier to read, and this decision is frequently used in official city publications, such as TTC maps, or official City of Toronto Maps.

In [21]:
for index, row in neighbourhoods.iterrows():
    rotated = shapely.affinity.rotate(row['geometry'], angle=-17, origin = Point(0, 0))
    neighbourhoods.at[index, 'geometry'] = rotated

Let's make a quick map of our borough groupings. I'll use the dissolve method for GeoDataFrames to group neighbourhoods by boroughs and dissolve their borders.

In [22]:
boroughs = neighbourhoods.dissolve(by = 'Borough')
city_boundary = gpd.geoseries.GeoSeries([neighbourhoods.unary_union]) # to create a single polygon representing Toronto

city_boundary.to_file('to_boundary.geojson', driver = 'GeoJSON')

I'm going to create a centroid column. This will make it easier to plot labels, as the labels will be over the centroid.

In [23]:
boroughs['centroid'] = boroughs['geometry'].apply(lambda x: x.centroid.coords[:])
boroughs['centroid'] = [i[0] for i in boroughs['centroid']]
boroughs = boroughs.reset_index()
In [65]:
from mpl_toolkits.axes_grid1.anchored_artists import AnchoredDirectionArrows # for north arrow

fig, ax = plt.subplots()
fig.set_size_inches(7,4)

boroughs.plot(cmap = 'tab10', ax = ax)

#north arrow
x, y, arrow_length = 0.85, 0.4, 0.15
ax.annotate('N', xy=(x + 0.15 * math.sin(math.radians(17)), y), xytext=(x, y- 0.15 * math.cos(math.radians(17))),
            arrowprops = dict(facecolor='black', width=1, headwidth=6),
            ha='center', va='center', fontsize=10,
            xycoords=ax.transAxes)


plt.suptitle('Boroughs used for the BikeShare Project', fontsize = 12, fontweight = 'bold', y = 0.94)
ax.set_title('Based on pre-1998 borders of Toronto', fontsize = 8)

ax.set_yticklabels([])
ax.set_xticklabels([])
ax.set_axis_off()

ax.add_artist(ScaleBar(1, location = 'lower right', length_fraction = 0.25)) # scalebar

for index, row in boroughs.iterrows(): #iteratively plots labels
    plt.annotate(s=row['Borough'], xy=row['centroid'], ha = 'center', fontsize = 8)

Now lets join the bikeshare stations locations to a neighbourhood and borough.

In [25]:
stations = pd.read_csv('bikeshare_stations.csv')

First we need to create a GeoDataFrame using the lat and lon found in the csv. Then we need to convert it to a UTM projection and rotate the geometry by 17 degrees.

In [26]:
stations_gdf = gpd.GeoDataFrame(stations, geometry = 
                                          gpd.points_from_xy(stations.lon, stations.lat)).set_crs(epsg = '4326').to_crs(epsg = '26917')

for index, row in stations_gdf.iterrows():
    rotated = shapely.affinity.rotate(row['geometry'], angle=-17, origin = Point(0, 0))
    stations_gdf.at[index, 'geometry'] = rotated

To spatially join the datasets, we'll use Geopandas' sjoin function to return which neighbourhood a bike share station is in.

In [27]:
stations_gdf = gpd.sjoin(neighbourhoods, stations_gdf)

Analyzing Trip Origins

First, let's count the number of departures by start stations. To reduce the complexity of the data, we can avoid drop the other columns. We'll name the count column as Trips.

In [28]:
trips_origin = trips_raw.groupby('Start Station Id').count()[['Trip Id']]
In [29]:
trips_origin = trips_origin.reset_index()
In [30]:
trips_origin = trips_origin.rename(columns = {'Trip Id': 'Trips'})
In [31]:
trips_origin.head()
Out[31]:
Start Station Id Trips
0 7000.0 54238
1 7001.0 28795
2 7002.0 42372
3 7003.0 26537
4 7004.0 23452

We'll use left join the dataset to the lookup table with the neighbourhoods and station Ids. Here we're doing two left joins, one with the GeoDataFrame of the stations, then with the GeoDataFrame of the neighbourhoods to make sure that when we plot the data, it will plot the neighbourhoods.

Then, we need to do another groupby aggregation (using the sum function) to sum up the trip origins by neighbourhood.

In [32]:
trips_origin_gdf = stations_gdf.merge(trips_origin, right_on = 'Start Station Id', left_on = 'Station Id', how = 'left')
In [33]:
trips_origin_gdf['Trips'] = trips_origin_gdf['Trips'].fillna(0)
In [34]:
trips_origin_gdf = neighbourhoods.merge(trips_origin_gdf.groupby('area_short').sum()[['Trips']].reset_index(), how = 'left')
In [35]:
fig, ax = plt.subplots()
fig.set_size_inches(7,4)

divider = make_axes_locatable(ax) 

cax = divider.append_axes("right", size="5%", pad=0.1) #to clean up the chloropleth legend

trips_origin_gdf.plot(column = 'Trips', cmap = 'YlGnBu', ax = ax, legend = True, 
                      zorder = 1, edgecolor = 'k', linewidth = 0.2,
                     legend_kwds={'label': "Trips Since 2017"}, cax = cax, vmin = 0, vmax = 1800000)

city_boundary.plot(ax = ax, zorder = 0, color = 'xkcd:light grey') #any neighbourhood without a bike share trip will be grey

#north arrow
x, y, arrow_length = 0.85, 0.4, 0.15
ax.annotate('N', xy=(x + 0.15 * math.sin(math.radians(17)), y), xytext=(x, y- 0.15 * math.cos(math.radians(17))),
            arrowprops = dict(facecolor='black', width=1, headwidth=6),
            ha='center', va='center', fontsize=10,
            xycoords=ax.transAxes)

plt.suptitle('BikeShare origins are concentrated downtown', fontsize = 12, fontweight = 'bold', y = 0.93)

ax.set_title('Chloropleth map of trip origins since 2017, by neighbourhood', fontsize = 8)

ax.set_yticklabels([])
ax.set_xticklabels([])
ax.set_axis_off()

ax.add_artist(ScaleBar(1, location = 'lower right', length_fraction = 0.25))# scale bar
Out[35]:
<matplotlib_scalebar.scalebar.ScaleBar at 0x7fec2aab6bd0>

The data shows that the trips are strongly concentrated downtown. Outside of downtown, there are some activity in neighbourhoods along the Waterfront, specifically the Martin Goodman Trail (such as South Parkdale, and Humber Bay Shores).

Let's make a bar chart to find out the names of the top 10 neighbourhoods for bikeshare trip origins.

In [36]:
fig, ax = plt.subplots()

fig.set_size_inches(7,4)

top_10_origins = trips_origin_gdf.sort_values(by = 'Trips', ascending = False).head(10)

sns.barplot(y = 'name', x = 'Trips', data = top_10_origins, ax = ax, orient = 'h', hue = 'Borough', dodge = False)

ax.set_yticks(range(10))
ax.set_xticks(range(0,1750000,250000))

ax.set_ylabel('Neighbourhood')
ax.set_xlabel('Trips')

ax.set_title('Number of trips since 2017, by neighbourhood', fontsize = 10)
plt.suptitle('Downtown is the biggest source of trips', fontweight = 'bold', fontsize = 12)

ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))

ax.set_yticklabels(top_10_origins['name'])

plt.show()

Again a similar story with the map. The non-downtown neighbourhoods that show up on this list are neighbourhoods that border the lake.

Whats not on this chart are other downtown neighbourhoods, such as St Jamestown, Cabbagetown, and Regent Park. This may suggest that income might play a factor in determining the number of bikeshare trips a neighbourhood has.

Analyzing Trip Destinations

Now let's repeat all the same steps above, but for trip destinations. This means we'll group by the End Station Id before joining.

In [37]:
trips_dest = trips_raw.groupby('End Station Id').count()[['Trip Id']]
trips_dest = trips_dest.reset_index()
trips_dest = trips_dest.rename(columns = {'Trip Id': 'Trips'})

trips_dest_gdf = stations_gdf.merge(trips_dest, right_on = 'End Station Id', left_on = 'Station Id', how = 'left')
trips_dest_gdf = neighbourhoods.merge(trips_dest_gdf.groupby('area_short').sum()[['Trips']].reset_index(), how = 'left')
In [64]:
fig, ax = plt.subplots()
fig.set_size_inches(7,4)

divider = make_axes_locatable(ax) 

cax = divider.append_axes("right", size="5%", pad=0.1) #to clean up the chloropleth legend

trips_dest_gdf.plot(column = 'Trips', cmap = 'YlGnBu', ax = ax, legend = True, 
                      zorder = 1, edgecolor = 'k', linewidth = 0.2,
                     legend_kwds={'label': "Trips Since 2017"}, cax = cax, vmin = 0, vmax = 1800000)

city_boundary.plot(ax = ax, zorder = 0, color = 'xkcd:light grey') #any neighbourhood without a bike share trip will be grey

#north arrow
x, y, arrow_length = 0.85, 0.4, 0.15
ax.annotate('N', xy=(x + 0.15 * math.sin(math.radians(17)), y), xytext=(x, y- 0.15 * math.cos(math.radians(17))),
            arrowprops = dict(facecolor='black', width=1, headwidth=6),
            ha='center', va='center', fontsize=10,
            xycoords=ax.transAxes)

plt.suptitle('BikeShare destinations are also concentrated downtown', fontsize = 12, fontweight = 'bold', y = 0.93)

ax.set_title('Chloropleth map of trip destinations since 2017, by neighbourhood', fontsize = 8)

ax.add_artist(ScaleBar(1, location = 'lower right', length_fraction = 0.25))# scale bar
Out[64]:
<matplotlib_scalebar.scalebar.ScaleBar at 0x7febbc0e5890>
In [39]:
fig, ax = plt.subplots()

fig.set_size_inches(7,4)

top_10_origins = trips_dest_gdf.sort_values(by = 'Trips', ascending = False).head(10)

sns.barplot(y = 'name', x = 'Trips', data = top_10_origins, ax = ax, orient = 'h', hue = 'Borough', dodge = False)

ax.set_yticks(range(10))
ax.set_xticks(range(0,2000000,250000))

ax.set_ylabel('Neighbourhood')
ax.set_xlabel('Trips')

ax.set_title('Number of trips since 2017, by neighbourhood', fontsize = 10)
plt.suptitle('A similar concentration in downtown exists for destinations', fontweight = 'bold', fontsize = 12)

ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))

ax.set_yticklabels(top_10_origins['name'])

plt.show()

Now let's to see if there are any imbalances between the origins and destinations. To do this, we'll merge the origin and destination tables together, then get the relative difference of the trips by taking

$ (origins - destinations)/(origins)*100\% $

A negative value would indicate the neighbourhood receives more destinations than origins, while a positive value would indicate more origns than destinations.

In [40]:
trips_neighbourhood = trips_origin_gdf.merge(trips_dest_gdf, on = ['area_short', 'name', 'geometry', 'Borough'], 
                       suffixes = ['_origin', '_destination'])
In [41]:
trips_neighbourhood['delta_perc'] = (trips_neighbourhood['Trips_origin'] - trips_neighbourhood['Trips_destination'])/trips_neighbourhood['Trips_origin']
In [42]:
trips_neighbourhood['delta_perc'].min(), trips_neighbourhood['delta_perc'].max()
Out[42]:
(-0.10467111120762666, 0.3214904679376083)
In [43]:
fig, ax = plt.subplots()
fig.set_size_inches(7,4)

divider = make_axes_locatable(ax) 

cax = divider.append_axes("right", size="5%", pad=0.1) #to clean up the chloropleth legend

trips_neighbourhood.plot(column = 'delta_perc', cmap = 'PiYG', ax = ax, legend = True, 
                      zorder = 1, edgecolor = 'k', linewidth = 0.2,
                     legend_kwds={'label': "Relative Difference Fraction"}, cax = cax, vmin = -0.3, vmax = 0.3)

city_boundary.plot(ax = ax, zorder = 0, color = 'xkcd:light grey') #any neighbourhood without a bike share trip will be grey

#north arrow
x, y, arrow_length = 0.85, 0.4, 0.15
ax.annotate('N', xy=(x + 0.15 * math.sin(math.radians(17)), y), xytext=(x, y- 0.15 * math.cos(math.radians(17))),
            arrowprops = dict(facecolor='black', width=1, headwidth=6),
            ha='center', va='center', fontsize=10,
            xycoords=ax.transAxes)

plt.suptitle('Users starting in Midtown are likely not making return trips', fontsize = 12, fontweight = 'bold', y = 0.93)

ax.set_title('Relative difference of origins to destinations, by neighbourhood', fontsize = 8)

ax.set_yticklabels([])
ax.set_xticklabels([])
ax.set_axis_off()

ax.add_artist(ScaleBar(1, location = 'lower right', length_fraction = 0.25))# scale bar
Out[43]:
<matplotlib_scalebar.scalebar.ScaleBar at 0x7feceb8323d0>

From the map, we can see that North End/Midtown neighbourhoods have more origins than destinations. We can probably infer that users starting in the North End are travelling south to downtown, the lakeshore, or the beaches, and do not want to bike back uphill.

We can see that the topography of the city plays a factor in Bikeshare trips.

Analyzing Bike Share Flows

Now that we know the origin and destination of Bike Share trips, lets see where they are going. To avoid having a 140x140 matrix, we'll use the borough definitions from earlier.

Lets merge the trips dataset with stations_gdf to assign which borough a trip starts at.

In [44]:
trips_chord = trips_raw.merge(stations_gdf, left_on = 'Start Station Id', right_on = 'Station Id')
trips_chord = trips_chord.rename(columns = {'area_short':'origin_area_short', 'Borough': 'Start Borough'})
trips_chord = trips_chord[['origin_area_short', 'Trip Id', 'End Station Id', 'Start Borough']]

trips_origin_gdf['Trips'] = trips_origin_gdf['Trips'].fillna(0)

We'll also repeat the process for the destinations.

In [45]:
trips_chord = trips_chord.merge(stations_gdf, left_on = 'End Station Id', right_on = 'Station Id')
trips_chord = trips_chord.rename(columns = {'area_short':'dest_area_short', 'Borough': 'End Borough'})
trips_chord = trips_chord[['Trip Id', 'Start Borough',  'End Borough']]
trips_chord.head()
Out[45]:
Trip Id Start Borough End Borough
0 712441 Downtown Downtown
1 717862 Downtown Downtown
2 719500 Downtown Downtown
3 722310 Downtown Downtown
4 723078 Downtown Downtown

Now let's perform a groupby action, using count() as the aggregation method.

In [46]:
trips_chord = trips_chord.groupby(['Start Borough', 'End Borough']).count()
In [47]:
trips_chord = trips_chord.reset_index()
In [48]:
trips_chord = trips_chord.rename(columns = {'Trip Id': 'Trips'})

To convert to an origin-destination matrix, we'll use pivot function found in pandas.

Traditionally in transportation, origins are represented by rows, while destinations are columns.

In [49]:
trips_matrix = trips_chord.pivot(index = 'Start Borough', columns = 'End Borough', values = 'Trips').fillna(0)
trips_matrix
Out[49]:
End Borough Downtown East End East York Etobicoke North End North York Scarborough West End York
Start Borough
Downtown 4238228.0 212746.0 9488.0 2087.0 202108.0 13.0 25264.0 666509.0 180.0
East End 194954.0 198633.0 13396.0 0.0 11285.0 47.0 2960.0 7227.0 2.0
East York 11756.0 13832.0 12461.0 0.0 3096.0 1234.0 248.0 184.0 1.0
Etobicoke 2294.0 1.0 0.0 34119.0 3.0 0.0 3.0 49291.0 958.0
North End 256020.0 11515.0 3369.0 4.0 142728.0 362.0 216.0 72214.0 1219.0
North York 17.0 18.0 1432.0 0.0 282.0 10955.0 1.0 0.0 4.0
Scarborough 22911.0 3573.0 228.0 4.0 113.0 27.0 5345.0 2690.0 0.0
West End 674135.0 7698.0 107.0 51523.0 51185.0 0.0 3312.0 791293.0 2756.0
York 526.0 3.0 3.0 963.0 1379.0 1.0 0.0 3026.0 1323.0

To make a chord diagram, we'll need to install a package that uses matplotlib to easily make chord diagrams.

As an inital visualizers, chord diagrams can convey the flow a lot easier than a heatmap, which needs annotations to be useful, and may be overwhelming for non-technical audiences.

In [50]:
!pip install mpl-chord-diagram
Requirement already satisfied: mpl-chord-diagram in /Users/Rick/opt/anaconda3/lib/python3.7/site-packages (0.2.0)
Requirement already satisfied: scipy in /Users/Rick/opt/anaconda3/lib/python3.7/site-packages (from mpl-chord-diagram) (1.4.1)
Requirement already satisfied: numpy in /Users/Rick/opt/anaconda3/lib/python3.7/site-packages (from mpl-chord-diagram) (1.18.1)
Requirement already satisfied: matplotlib in /Users/Rick/opt/anaconda3/lib/python3.7/site-packages (from mpl-chord-diagram) (3.1.3)
Requirement already satisfied: python-dateutil>=2.1 in /Users/Rick/opt/anaconda3/lib/python3.7/site-packages (from matplotlib->mpl-chord-diagram) (2.8.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/Rick/opt/anaconda3/lib/python3.7/site-packages (from matplotlib->mpl-chord-diagram) (2.4.6)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/Rick/opt/anaconda3/lib/python3.7/site-packages (from matplotlib->mpl-chord-diagram) (1.1.0)
Requirement already satisfied: cycler>=0.10 in /Users/Rick/opt/anaconda3/lib/python3.7/site-packages (from matplotlib->mpl-chord-diagram) (0.10.0)
Requirement already satisfied: six>=1.5 in /Users/Rick/opt/anaconda3/lib/python3.7/site-packages (from python-dateutil>=2.1->matplotlib->mpl-chord-diagram) (1.14.0)
Requirement already satisfied: setuptools in /Users/Rick/opt/anaconda3/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib->mpl-chord-diagram) (46.0.0.post20200309)
In [51]:
from mpl_chord_diagram import chord_diagram
In [52]:
#manually defining names so that certain boroughs are on 2 lines.

names = ['Downtown', 'East\nEnd', 'East\nYork', 'Etobicoke', 'North\nEnd',
       'North\nYork', 'Scarborough', 'West\nEnd', 'York'] 
In [53]:
fig, ax = plt.subplots()

plt.suptitle('Most Trips Are Within Downtown, or Old Toronto', x = 0.249, y = 0.99, ha = 'left', fontsize = 10, fontweight = 'bold')
ax.text(s = 'Chord diagram of origin and destination of BikeshareTO trips, by Borough', x = 0.03, y = 1.6, ha = 'center', fontsize = 6)

chord = chord_diagram(trips_matrix, names = names, cmap = 'Set1', ax = ax, fontsize = 8, rotate_names = 90,
                      order = [0,2,1,3,4,5,7,6,8], pad = 4, use_gradient =  True)

Here, we can easily see that Downtown to Downtown trips make up close to half of the trips. The other Old Toronto boroughs have a significant amount trips, and roughly half of their trips go to/come from downtown. Inter-borough trips are also a significant source of trips for all boroughs.

Of the remaining boroughs, Etobicoke (again likely due to the Martin Goodman Trail) looks to be the busiest, while York and North York are among the bottom for trips. This is likely because Scarborough has signifcant activity.

Another way to visualize the flow between regions of Toronto is to make a heat map. We'll convert the data in the matrix to ratios/percentages, and then uses sns.heatmap to make the graph.

In [54]:
order = ['Downtown',  'West End', 'East End', 'North End', 'East York', 'Etobicoke',
       'North York', 'Scarborough', 'York']
In [55]:
trips_matrix = trips_matrix.reindex(index=order, columns=order)
In [56]:
trips_matrix_perc = trips_matrix/trips_matrix.sum().sum()
In [57]:
ax = sns.heatmap(trips_matrix_perc, fmt = '.1%', annot = True, cmap="YlGnBu", annot_kws={"fontsize":8})

ax.text(s = 'Intra-Downtown trips constitute half of all trips', fontweight = 'bold', y = -0.70, x = 0, ha = 'left')
ax.text(s = 'Heatmap representation of BikeShare Toronto origin-destination matrix, by borough', fontsize = 7, y = -0.25, x = 0, ha = 'left')
Out[57]:
Text(0, -0.25, 'Heatmap representation of BikeShare Toronto origin-destination matrix, by borough')

The advantage of the heat map is that we can clearly see the numeric values that the chord diagram represents. Downtown clearly dominates the number of trips, so in order to visualize the data in other boroughs, we'll take out intra-downtown trips, recalculate the ratios, and make a new heatmap.

In [58]:
trips_matrix_mod = trips_matrix
trips_matrix_mod.iloc[0,0] = None
In [59]:
trips_matrix_perc_mod = trips_matrix_mod/trips_matrix_mod.sum().sum()
In [60]:
ax = sns.heatmap(trips_matrix_perc_mod, fmt = '.1%', annot = True, cmap="YlGnBu", annot_kws={"fontsize":8})

ax.text(s = 'West End Toronto is another center of BikeShare trips', fontweight = 'bold', y = -0.70, x = 0, ha = 'left')
ax.text(s = 'Heatmap representation of BikeShare Toronto origin-destination matrix, by borough', fontsize = 7, y = -0.25, x = 0, ha = 'left')
Out[60]:
Text(0, -0.25, 'Heatmap representation of BikeShare Toronto origin-destination matrix, by borough')

Similar to the chord diagram, we see that trips in Old Toronto capture the rest of the BikeShare trips, and there is very limited activity in the suburbs. Of the Old Toronto neighbourhoods, West End neighbourhoods has the highest activity, which could be influenced by the Martin Goodman trail, the presence of the Bloor Bike Lane, and generally more bikelane infrastructure in the West End.

Routing Trips

Finding the Busiest Road Segment by Routing Trips

One question that we would like to answer is which roads are bikeshare users using?

There are 2 easy ways to analyze this:

  1. We connect the origin and destination of a trip with a straight line, then construct a Kernal Density map. We can then spatially interpolate the "trip activity" each road gets, and convert that to raw trips.
  2. We can use a shortest path algorithm to route trips through the road network, for each origin-destination pair.

Method 2 would produce a more real result, that is less abstract, so we'll use this method of finding the busiest road segment from bikeshare users.

Introduction to networkx

We can use a package called networkx. networkx uses a method called graph or network theory, which is a representation of a network using nodes and links. We can then use a built in shortest path algorithm to find the path a trip takes in the network.

Below is an example graph that has been randomly created.

In [393]:
# Create a random graph with 10 graph, and probability an edge exists is 0.4
test = nx.fast_gnp_random_graph(10, 0.4, 0)

nx.draw(test)

For our application, the edges/links are the streets/trails, and the nodes are the intersections.

The City of Toronto has a open data spatial layer, the centreline, that contains the individual road segments of the city. Each segment is approximately 100m long, although this greatly varies since it depends on where an intersection is.

The centreline layer is linked directly to the intersections layer that the City of Toronto also has on their open data portal. The FNODE and TNODE columns refer to the INT_ID of the intersections.

In [395]:
centreline = gpd.read_file('centreline.geojson')
intersections = gpd.read_file('intersections/CENTRELINE_INTERSECTION_WGS84.shp')
In [396]:
centreline.head()
Out[396]:
GEO_ID LFN_ID LF_NAME ADDRESS_L ADDRESS_R OE_FLAG_L OE_FLAG_R LONUML HINUML LONUMR HINUMR FNODE TNODE FCODE FCODE_DESC JURIS_CODE OBJECTID SHAPE_LENGTH geometry
0 914632 2273 Phlox Ave 2-20 3-21 E O 2.0 20.0 3.0 21.0 13470579 13470553 201500 Local CITY OF TORONTO 31 96.539575 LINESTRING (-79.52648 43.59696, -79.52686 43.5...
1 914590 1532 Elder Ave 91-91 None O N 91.0 91.0 NaN NaN 13470547 13470554 201400 Collector CITY OF TORONTO 40 98.707488 LINESTRING (-79.52966 43.59804, -79.53082 43.5...
2 9109246 2617 Thirty Second St 58-70 57-75 E O 58.0 70.0 57.0 75.0 13470563 13470545 201500 Local CITY OF TORONTO 41 86.709803 LINESTRING (-79.53205 43.59747, -79.53239 43.5...
3 914605 1532 Elder Ave None None N N NaN NaN NaN NaN 13470554 13470563 201500 Local CITY OF TORONTO 42 104.174108 LINESTRING (-79.53082 43.59776, -79.53205 43.5...
4 9109255 2617 Thirty Second St 2-56 1-55 E O 2.0 56.0 1.0 55.0 13470618 13470563 201500 Local CITY OF TORONTO 43 281.757798 LINESTRING (-79.53099 43.59506, -79.53205 43.5...

For our analysis, we will use the following columns from the centreline:

  • GEO_ID - the unique identifier for a segment
  • LFN_ID and LFN_NAME - the identifier and string name of the street
  • FNODE and TNODE - the INT_ID that corresponds to the intersections in the intersections layer
  • FCODE_DESC - the road classification
  • geometry
In [297]:
centreline = centreline[['GEO_ID', 'LFN_ID', 'LF_NAME', 'FNODE', 'TNODE', 'FCODE_DESC', 'geometry']]

For all the spatial layers, I'll be converting them to UTM ZONE 17N projection, and rotating the layer 17 degrees so Toronto is perpendicular with the map boundary.

In [401]:
centreline = centreline.to_crs(epsg = '26917')
for index, row in centreline.iterrows():
    rotated = shapely.affinity.rotate(row['geometry'], angle=-17, origin = Point(0, 0))
    centreline.at[index, 'geometry'] = rotated

The default way for a shortest path algorithm to work is to count the number of links, but we can specify a different weight to find the shortest path. We'll calculate a length column (length in meters) for networkx to use in its shortest path calculation.

In [299]:
centreline['length'] = centreline.length

Many different road classes exist in the centreline layer.

In [300]:
centreline['FCODE_DESC'].unique()
Out[300]:
array(['Local', 'Collector', 'River', 'Geostatistical line',
       'Major Shoreline', 'Major Arterial', 'Major Railway',
       'Minor Arterial', 'Expressway Ramp', 'Expressway',
       'Major Arterial Ramp', 'Hydro Line', 'Laneway', 'Walkway', 'Trail',
       'Other', 'Pending', 'Access Road', 'Minor Shoreline (Land locked)',
       'Ferry Route', 'Busway', 'Collector Ramp', 'Creek/Tributary',
       'Other Ramp', 'Minor Arterial Ramp', 'Minor Railway'], dtype=object)

We obviously don't want to accidentally route trips over a path that might take them over the Gardiner expressway. The centreline layer also maps the "centre lines" of other features such as pre-1998 municipal boundaries. rivers, railways, former roads, and hydro lines. We'll need to subset the centreline to only use links that bikeshare users will be able to physically ride on.

Note: Much of my knowledge on city provided spatial layers comes from the fact that I worked at the City of Toronto during PEY.

In [301]:
valid_roads = ['Local', 'Collector', 
       'Major Arterial', 
       'Minor Arterial', 
        'Trail', 'Walkway']
In [302]:
centreline_valid = centreline[centreline['FCODE_DESC'].isin(valid_roads)]
centreline_valid 
Out[302]:
GEO_ID LFN_ID LF_NAME FNODE TNODE FCODE_DESC geometry length
0 914632 2273 Phlox Ave 13470579 13470553 Local LINESTRING (2003509.729 4436237.822, 2003505.4... 96.527457
1 914590 1532 Elder Ave 13470547 13470554 Collector LINESTRING (2003295.705 4436423.294, 2003197.0... 98.694061
2 9109246 2617 Thirty Second St 13470563 13470545 Local LINESTRING (2003092.979 4436416.065, 2003089.1... 86.698715
3 914605 1532 Elder Ave 13470554 13470563 Local LINESTRING (2003197.079 4436419.637, 2003092.9... 104.161582
4 9109255 2617 Thirty Second St 13470618 13470563 Local LINESTRING (2003101.344 4436134.467, 2003092.9... 281.722130
... ... ... ... ... ... ... ... ...
70194 104377 9508 Royal Albert Cres 13442765 13442745 Local LINESTRING (2027228.496 4453765.649, 2027247.9... 49.209075
70195 109441 8891 Scarborough Golf Club Rd 13449063 13448879 Minor Arterial LINESTRING (2032598.741 4446321.793, 2032593.2... 164.952905
70196 30064466 30032 Sewells Rd 13441591 30064465 Local LINESTRING (2035673.092 4455681.813, 2035667.7... 116.564957
70197 170 161 Glenbrae Ave 13455519 13455161 Local LINESTRING (2019500.077 4445421.439, 2019498.3... 258.928616
70199 440829 1610 Finch Ave W 13449808 13449874 Major Arterial LINESTRING (2008319.950 4453156.785, 2008183.4... 136.519816

53137 rows × 8 columns

While we're at it, lets also load in the bikeways layer, also from Toronto Open Data.

In [303]:
bikeways = gpd.read_file('bikeways.geojson')
bikeways = bikeways.to_crs(epsg = '26917')
for index, row in bikeways.iterrows():
    rotated = shapely.affinity.rotate(row['geometry'], angle=-17, origin = Point(0, 0))
    bikeways.at[index, 'geometry'] = rotated
In [304]:
bikeways.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 1262 entries, 0 to 1261
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   _id                  1262 non-null   int64   
 1   OBJECTID             1262 non-null   int64   
 2   SEGMENT_ID           1204 non-null   float64 
 3   INSTALLED            1262 non-null   int64   
 4   UPGRADED             1211 non-null   float64 
 5   PRE_AMALGAMATION     1262 non-null   object  
 6   STREET_NAME          1250 non-null   object  
 7   FROM_STREET          1248 non-null   object  
 8   TO_STREET            1247 non-null   object  
 9   ROADCLASS            1242 non-null   object  
 10  CNPCLASS             1208 non-null   object  
 11  SURFACE              1220 non-null   object  
 12  OWNER                889 non-null    object  
 13  DIR_LOWORDER         1242 non-null   object  
 14  INFRA_LOWORDER       1260 non-null   object  
 15  SEPA_LOWORDER        205 non-null    object  
 16  SEPB_LOWORDER        7 non-null      object  
 17  ORIG_LOWORDER_INFRA  240 non-null    object  
 18  DIR_HIGHORDER        1180 non-null   object  
 19  INFRA_HIGHORDER      1182 non-null   object  
 20  SEPA_HIGHORDER       203 non-null    object  
 21  SEPB_HIGHORDER       6 non-null      object  
 22  ORIG_HIGHORDER       222 non-null    object  
 23  BYLAWED              224 non-null    object  
 24  LAST_EDIT_DATE       1262 non-null   object  
 25  UPGRADE_DESCRIPTION  65 non-null     object  
 26  Shape__Length        1262 non-null   float64 
 27  geometry             1262 non-null   geometry
dtypes: float64(3), geometry(1), int64(3), object(21)
memory usage: 276.2+ KB

It appears 4 columns refer to the level of bike infrastructure a path has, however these 4 columns don't always agree. My sense is that the bikeways layer was created from the centreline layer where the segments are combined into an individual road instead of a road segment. This had the effect of combining road segments with different bike infrastructure, and these 4 columns were the byproduct of that.

The INFRA_HIGHORDER column, from manual inspection, seems to be the most correct, so we'll use that column as the column that tells us the bike infrastructure a link has.

In [305]:
bikeways[['ORIG_LOWORDER_INFRA', 'INFRA_HIGHORDER','INFRA_LOWORDER', 'ORIG_HIGHORDER']]
Out[305]:
ORIG_LOWORDER_INFRA INFRA_HIGHORDER INFRA_LOWORDER ORIG_HIGHORDER
0 Sharrows Bike Lane Bike Lane Sharrows
1 Multi-Use Trail Multi-Use Trail MUT (2016 Network Plan/2012 Trails Plan) Multi-Use Trail
2 Multi-Use Trail Multi-Use Trail MUT (2016 Network Plan/2012 Trails Plan) Multi-Use Trail
3 Multi-Use Trail Multi-Use Trail MUT (2016 Network Plan/2012 Trails Plan) Multi-Use Trail
4 Multi-Use Trail Multi-Use Trail - Entrance MUT - Entrance (2016 Network Plan/2012 Trails ... Multi-Use Trail
... ... ... ... ...
1257 None Bike Lane Signed Route (No Pavement Markings) None
1258 None Signed Route (No Pavement Markings) Signed Route (No Pavement Markings) None
1259 Signed Route (No Pavement Markings) Sharrows Sharrows Signed Route (No Pavement Markings)
1260 Signed Route (No Pavement Markings) Bike Lane Sharrows Signed Route (No Pavement Markings)
1261 None MUT - Entrance (2016 Network Plan/2012 Trails ... MUT - Entrance (2016 Network Plan/2012 Trails ... None

1262 rows × 4 columns

To combine the centreline and bikeways layer, we'll have to follow a couple of steps:

  1. Buffer each bikeway in the bikeways layer by 10m
  2. Use a spatial join with the contains operation on the centreline. This will capture all segments within that buffer, and assign them to a bikeway, making a lookup table
  3. Use that lookup table to assign bike infrastructure levels to each road segment (if available)
In [306]:
bikeways_buffer = bikeways
bikeways_buffer['geometry'] = bikeways_buffer.buffer(10)
In [307]:
bikeway_lookup = gpd.sjoin(bikeways_buffer, centreline_valid, how = 'left', op = 'contains')[['GEO_ID','INFRA_HIGHORDER']]
In [308]:
centreline_valid = centreline_valid.merge(bikeway_lookup, how = 'left')
centreline_valid
Out[308]:
GEO_ID LFN_ID LF_NAME FNODE TNODE FCODE_DESC geometry length INFRA_HIGHORDER
0 914632 2273 Phlox Ave 13470579 13470553 Local LINESTRING (2003509.729 4436237.822, 2003505.4... 96.527457 NaN
1 914590 1532 Elder Ave 13470547 13470554 Collector LINESTRING (2003295.705 4436423.294, 2003197.0... 98.694061 Signed Route (No Pavement Markings)
2 9109246 2617 Thirty Second St 13470563 13470545 Local LINESTRING (2003092.979 4436416.065, 2003089.1... 86.698715 NaN
3 914605 1532 Elder Ave 13470554 13470563 Local LINESTRING (2003197.079 4436419.637, 2003092.9... 104.161582 NaN
4 9109255 2617 Thirty Second St 13470618 13470563 Local LINESTRING (2003101.344 4436134.467, 2003092.9... 281.722130 NaN
... ... ... ... ... ... ... ... ... ...
53382 104377 9508 Royal Albert Cres 13442765 13442745 Local LINESTRING (2027228.496 4453765.649, 2027247.9... 49.209075 NaN
53383 109441 8891 Scarborough Golf Club Rd 13449063 13448879 Minor Arterial LINESTRING (2032598.741 4446321.793, 2032593.2... 164.952905 NaN
53384 30064466 30032 Sewells Rd 13441591 30064465 Local LINESTRING (2035673.092 4455681.813, 2035667.7... 116.564957 NaN
53385 170 161 Glenbrae Ave 13455519 13455161 Local LINESTRING (2019500.077 4445421.439, 2019498.3... 258.928616 NaN
53386 440829 1610 Finch Ave W 13449808 13449874 Major Arterial LINESTRING (2008319.950 4453156.785, 2008183.4... 136.519816 NaN

53387 rows × 9 columns

The entries for INFRA_HIGHORDER are not cleaned, and many labels are roughly similar to each other. We'll need to do some manual cleaning of this column.

In [309]:
centreline_valid['INFRA_HIGHORDER'].unique()
Out[309]:
array([nan, 'Signed Route (No Pavement Markings)', 'Bike Lane',
       'Sharrows - Wayfinding', 'Multi-Use Trail - Entrance',
       'Multi-Use Trail', 'Park Road', None,
       'MUT (2016 Network Plan/2012 Trails Plan)',
       'Multi-Use Trail - Boulevard', 'Cycle Track',
       'MUT - Boulevard (2016 Network Plan/2012 Trails Plan)',
       'Bike Lane - Buffered', 'Sharrows - Arterial',
       'Contraflow Bike Lane', 'Sharrows', 'Edge Line',
       'Bike Lane - Contraflow', 'Contraflow Cycle Track',
       'Cycle Track - Contraflow', 'Multi-Use Trail (Local)',
       'MUT - Entrance (2016 Network Plan/2012 Trails Plan)'],
      dtype=object)
In [310]:
centreline_valid['FCODE_DESC'].unique()
Out[310]:
array(['Local', 'Collector', 'Major Arterial', 'Minor Arterial',
       'Walkway', 'Trail'], dtype=object)

Route preference of cylists

An additional complication is that while it may be easy to assume that all bikeshare users will use bikeways, this is obviously not the case, and they may choose to ride on regular arterials if they are comfortable.

Cyclists also have different levels of preference. Most cyclists will choose a cycle-track (the city defines them as bikelanes with physical separation), over a road with sharrows, unless it is a huge detour.

We'll use the concept of generalized cost to route our trips. Instead of finding the shortest length path, it will find the path with the least cost (or greatest utility). A path thats on a bike lane will cost less and bring a cyclist greater utility than a path on a busy arterial.

We'll use the length of a road segment as a starting point for our cost, then factor the length based on how willing they are to choose each type of road over another in their route planning. This means that on a map, cyclists will be willing to take a detour if it means that they can ride on a trail or a bike lane, but only to a certain threshold, and are willing to use an arterial if its a lot more direct than other options.

One study out of UBC conducted a survey of riders for their route preference and the type of bike infrastructure.

In [399]:
from IPython.display import Image
Image('route_preference.png')
Out[399]:

We'll use a modified version of this table to factor our lengths in the column length_fac. For each type of road, we'll set the factor to:

$factor = (1 - score)$

We'll use a bunch of np.where statements to set the factor if it meets each conditional.

In [311]:
centreline_valid['length_fac'] = None
In [312]:
# multiuse paths
centreline_valid['length_fac'] = np.where(centreline_valid['INFRA_HIGHORDER'].isin(['Multi-Use Trail - Entrance',
                                                                                   'Multi-Use Trail','Park Road',
                                                                                   'Multi-Use Trail - Boulevard',
                                                                                   'MUT (2016 Network Plan/2012 Trails Plan)',
                                                                                    'Multi-Use Trail (Local)',
                                                                                    'MUT - Entrance (2016 Network Plan/2012 Trails Plan)']) 
                                          , 0.5, centreline_valid['length_fac'])
In [313]:
# catching errors in the where the Martin Goodman Trail bikeway does not exactly overlap with the centreline
centreline_valid['length_fac'] = np.where(centreline_valid['FCODE_DESC'].isin(['Trail']) 
                                          , 0.5, centreline_valid['length_fac']) 
In [314]:
# all separated bike paths or cycle tracks
centreline_valid['length_fac'] = np.where(centreline_valid['INFRA_HIGHORDER'].isin(['Contraflow Cycle Track',
                                                                                   'Cycle Track - Contraflow',
                                                                                   'Cycle Track']) 
                                          , 0.6, centreline_valid['length_fac'])
In [315]:
# residential roads
centreline_valid['length_fac'] = np.where(centreline_valid['INFRA_HIGHORDER'].isnull()
                                          & centreline_valid['FCODE_DESC'].isin(['Local', 'Collector', 'Walkway']),
                                          0.9, centreline_valid['length_fac'])
In [316]:
# residential roads signed as bike routes
centreline_valid['length_fac'] = np.where(centreline_valid['INFRA_HIGHORDER'].isin(['Signed Route (No Pavement Markings)',
                                                                                   'Sharrows - Wayfinding',
                                                                                   'Sharrows']) 
                                          & centreline_valid['FCODE_DESC'].isin(['Local', 'Collector']), 0.7, centreline_valid['length_fac'])
In [317]:
# residential bike lanes
centreline_valid['length_fac'] = np.where(centreline_valid['INFRA_HIGHORDER'].isin(['Bike Lane', 
                                                                                    'Bike Lane - Buffered', 
                                                                                    'Contraflow Bike Lane', 
                                                                                   'Bike Lane - Contraflow']) 
                                          & centreline_valid['FCODE_DESC'].isin(['Local', 'Collector']), 0.6, centreline_valid['length_fac'])
In [318]:
# arterials or major streets
centreline_valid['length_fac'] = np.where(centreline_valid['INFRA_HIGHORDER'].isnull()
                                          & centreline_valid['FCODE_DESC'].isin(['Major Arterial', 'Minor Arterial']),
                                          1.5, centreline_valid['length_fac'])
In [319]:
# bike lanes on arterials (but not cycle tracks/separated bike lanes)
centreline_valid['length_fac'] = np.where(centreline_valid['INFRA_HIGHORDER'].isin(['Bike Lane', 
                                                                                    'Bike Lane - Buffered', 
                                                                                    'Contraflow Bike Lane', 
                                                                                   'Bike Lane - Contraflow'])
                                          & centreline_valid['FCODE_DESC'].isin(['Major Arterial', 'Minor Arterial']),
                                          0.9, centreline_valid['length_fac'])
In [320]:
# arterials signed as bike routes
centreline_valid['length_fac'] = np.where(centreline_valid['INFRA_HIGHORDER'].isin(['Signed Route (No Pavement Markings)',
                                                                                   'Sharrows - Wayfinding',
                                                                                   'Sharrows', 'Sharrows - Arterial',
                                                                                   'Edge Line'])
                                          & centreline_valid['FCODE_DESC'].isin(['Major Arterial', 'Minor Arterial']),
                                          1.2, centreline_valid['length_fac'])
In [321]:
# lake shore trail was not properly matched in the lookup table, so we'll do some manual adjustments
centreline_valid['length_fac'] = np.where(centreline_valid['LF_NAME'] == 'Lake Shore Blvd East Trl',
                                          0.5, centreline_valid['length_fac'])

Our length_adj column will be the column that represents the cost to the cyclist as they take a path. We'll be using this column instead of length in networkx.

In [322]:
centreline_valid['length_adj'] = centreline_valid['length'] * centreline_valid['length_fac'] 

Preparing the Nodes

In [323]:
intersections.head()
Out[323]:
INT_ID ELEV_ID INTERSEC5 CLASSIFI6 CLASSIFI7 NUM_ELEV ELEVATIO9 ELEVATIO10 ELEV_LEVEL ELEVATION ELEVATIO13 HEIGHT_R14 HEIGHT_R15 X Y LONGITUDE LATITUDE OBJECTID geometry
0 13469938 20548 GARDINER EXPRESS / Kipling Ave SEUML Pseudo Intersection-Overpass/Underpass 2 509200 Pseudo 0 0.0 None 0.0 None 302703.415 4830674.977 -79.525779 43.618007 41.0 POINT (-79.52578 43.61801)
1 13469938 29774 GARDINER EXPRESS / Kipling Ave SEUML Pseudo Intersection-Overpass/Underpass 2 509200 Pseudo 1 0.0 None 0.0 None 302703.415 4830674.977 -79.525779 43.618007 42.0 POINT (-79.52578 43.61801)
2 13469952 9472 Kipling Ave / Kipling S Gardiner C E Ramp MNRML Minor-Multi Level 2 501300 Minor 1 0.0 None 0.0 None 302713.398 4830626.325 -79.525656 43.617569 39.0 POINT (-79.52566 43.61757)
3 13469952 28079 Kipling Ave / Kipling S Gardiner C E Ramp MNRML Minor-Multi Level 2 509200 Pseudo 0 0.0 None 0.0 None 302713.398 4830626.325 -79.525656 43.617569 40.0 POINT (-79.52566 43.61757)
4 13469946 32748 GARDINER EXPRESS / Kipling Ave SEUML Pseudo Intersection-Overpass/Underpass 2 509200 Pseudo 0 0.0 None 0.0 None 302711.225 4830637.839 -79.525683 43.617673 45.0 POINT (-79.52568 43.61767)

We'll be using these columns from the intersections file:

  • INT_ID and INTERSEC5 - to identify the intersection and to link it to the centreline
  • CLASSIFI6 and CLASSIFI7 - the classification of intersection
  • geometry
In [324]:
intersections = intersections[['INT_ID', 'INTERSEC5', 'CLASSIFI6', 'CLASSIFI7', 'geometry']]
In [325]:
intersections = intersections.to_crs(epsg = '26917')
for index, row in intersections.iterrows():
    rotated = shapely.affinity.rotate(row['geometry'], angle=-17, origin = Point(0, 0))
    intersections.at[index, 'geometry'] = rotated

While we aren't using these classifications, its interesting to note that like the centreline, the intersections file includes more than just road intersections. Pseudo intersections are ones where a road on an overpass/tunnel crosses another road, while Statistical intersections are when roads cross a jurisdictional boundary.

In [326]:
intersections['CLASSIFI7'].unique()
Out[326]:
array(['Pseudo Intersection-Overpass/Underpass', 'Minor-Multi Level',
       'Lesser-Single Level', 'Minor-Single Level',
       'Expressway Interchange-Single Level', 'Cul de Sac-Single Level',
       'Pseudo Intersection-Single Level', 'Statistical-Single Level',
       'Lesser-Multi Level', 'Major-Single Level', 'Major-Multi Level',
       'Error No Domain Descrption'], dtype=object)

networkx requires us to feed them a discrete x and y coordinate, so we'll construct this column now.

In [327]:
intersections['X'] = intersections.geometry.x
intersections['Y'] = intersections.geometry.y
intersections
Out[327]:
INT_ID INTERSEC5 CLASSIFI6 CLASSIFI7 geometry X Y
0 13469938 GARDINER EXPRESS / Kipling Ave SEUML Pseudo Intersection-Overpass/Underpass POINT (2004207.843 4438469.314) 2.004208e+06 4.438469e+06
1 13469938 GARDINER EXPRESS / Kipling Ave SEUML Pseudo Intersection-Overpass/Underpass POINT (2004207.843 4438469.314) 2.004208e+06 4.438469e+06
2 13469952 Kipling Ave / Kipling S Gardiner C E Ramp MNRML Minor-Multi Level POINT (2004204.058 4438419.800) 2.004204e+06 4.438420e+06
3 13469952 Kipling Ave / Kipling S Gardiner C E Ramp MNRML Minor-Multi Level POINT (2004204.058 4438419.800) 2.004204e+06 4.438420e+06
4 13469946 GARDINER EXPRESS / Kipling Ave SEUML Pseudo Intersection-Overpass/Underpass POINT (2004205.136 4438431.466) 2.004205e+06 4.438431e+06
... ... ... ... ... ... ... ...
53730 13455527 Divadale Dr / Don Avon Dr MNRSL Minor-Single Level POINT (2019810.850 4445320.060) 2.019811e+06 4.445320e+06
53731 13455391 Broadway Ave / Don Avon Dr MNRSL Minor-Single Level POINT (2019809.598 4445420.784) 2.019810e+06 4.445421e+06
53732 13455352 Brentcliffe Rd / Broadway Ave / Rykert Cres MNRSL Minor-Single Level POINT (2019896.562 4445424.509) 2.019897e+06 4.445425e+06
53733 13455357 Brentcliffe Rd MNRSL Minor-Single Level POINT (2019982.356 4445395.287) 2.019982e+06 4.445395e+06
53734 13455308 Brentcliffe Rd / Broadway Ave MNRSL Minor-Single Level POINT (2019992.486 4445424.112) 2.019992e+06 4.445424e+06

53735 rows × 7 columns

In [328]:
nodes = intersections[['INT_ID', 'X', 'Y']]

Feeding into networkx

We'll do a double inner join on the FNODE and TNODE to make sure that any link we send in has an associated intersection in the intersections file.

In [329]:
edges = centreline_valid.merge(nodes, left_on = 'FNODE', right_on = 'INT_ID', how = 'inner')[['FNODE', 'TNODE', 'GEO_ID', 'length_adj']]
edges = edges.merge(nodes, left_on = 'TNODE', right_on = 'INT_ID', how = 'inner')[['FNODE', 'TNODE', 'GEO_ID', 'length_adj']]
edges
Out[329]:
FNODE TNODE GEO_ID length_adj
0 13470579 13470553 914632 86.874711
1 13470540 13470553 7632579 131.404261
2 13470579 13470587 914644 164.727813
3 14301439 13470587 14301438 43.114767
4 30105416 13470587 30105484 34.139422
... ... ... ... ...
64162 13447087 13446984 6541762 122.229607
64163 20051187 13443677 20051186 39.50055
64164 20051187 13443677 20051186 39.50055
64165 13441591 30064465 30064466 104.908461
64166 13449808 13449874 440829 204.779724

64167 rows × 4 columns

In [330]:
edges = edges.drop_duplicates()

Similarly, since we already filtered out a lot of features such as railways and rivers, we'll do another set of inner joins to make sure we only include nodes connecting to a link.

In [331]:
nodes = nodes.merge(edges, left_on = 'INT_ID', right_on = 'FNODE', how = 'inner')[['INT_ID', 'X', 'Y']]
nodes = nodes.merge(edges, left_on = 'INT_ID', right_on = 'TNODE', how = 'inner')[['INT_ID', 'X', 'Y']]
nodes
Out[331]:
INT_ID X Y
0 13469938 2.004208e+06 4.438469e+06
1 13469938 2.004208e+06 4.438469e+06
2 13469952 2.004204e+06 4.438420e+06
3 13469952 2.004204e+06 4.438420e+06
4 13469946 2.004205e+06 4.438431e+06
... ... ... ...
74896 13455352 2.019897e+06 4.445425e+06
74897 13455357 2.019982e+06 4.445395e+06
74898 13455357 2.019982e+06 4.445395e+06
74899 13455308 2.019992e+06 4.445424e+06
74900 13455308 2.019992e+06 4.445424e+06

74901 rows × 3 columns

In [332]:
nodes = nodes.drop_duplicates()
nodes
Out[332]:
INT_ID X Y
0 13469938 2.004208e+06 4.438469e+06
2 13469952 2.004204e+06 4.438420e+06
4 13469946 2.004205e+06 4.438431e+06
6 13469942 2.004206e+06 4.438449e+06
8 13469930 2.004210e+06 4.438489e+06
... ... ... ...
74883 13455527 2.019811e+06 4.445320e+06
74887 13455391 2.019810e+06 4.445421e+06
74891 13455352 2.019897e+06 4.445425e+06
74897 13455357 2.019982e+06 4.445395e+06
74899 13455308 2.019992e+06 4.445424e+06

33741 rows × 3 columns

In [333]:
edges
Out[333]:
FNODE TNODE GEO_ID length_adj
0 13470579 13470553 914632 86.874711
1 13470540 13470553 7632579 131.404261
2 13470579 13470587 914644 164.727813
3 14301439 13470587 14301438 43.114767
4 30105416 13470587 30105484 34.139422
... ... ... ... ...
64159 30021444 13444548 30021446 41.239712
64161 13447087 13446984 6541762 122.229607
64163 20051187 13443677 20051186 39.50055
64165 13441591 30064465 30064466 104.908461
64166 13449808 13449874 440829 204.779724

52176 rows × 4 columns

First we'll create the network, or the graph.

In [335]:
g = nx.Graph()

Then we'll feed in the links. The first 2 arguments are the start and end nodes, the third is the length_adj column, and the last argument is the unique identifier for the link, or the GEO_ID that ties back to the centreline file.

In [336]:
for index, row in edges.iterrows():
    g.add_edge(row[0], row[1], length = row[3], key = row[2])

For the nodes, the first argument is the unique identifier, while the second is the X and Y coordinates. We won't be using networkx to draw our network flows since eats too much memory, so the position argument is not as important.

In [337]:
for i, row in nodes.iterrows():
    g.add_node(row[0], pos = (row[1], row[2]))

Preparing the Trips

In [338]:
trips = pd.read_csv('trips_raw_data.csv')
/Users/Rick/opt/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3063: DtypeWarning: Columns (11,23) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [339]:
trips['Start Time'] = pd.DatetimeIndex(trips['Start Time'])

We'll only be analyzing 2019, since it represents a normal year.

In [340]:
trips = trips[trips['Start Time'].dt.year == 2019]
In [341]:
trips
Out[341]:
Unnamed: 0 Trip Id Subscription Id Trip Duration Start Station Id Start Time Start Station Name End Station Id End Time End Station Name ... Temp (°C) Dew Point Temp (°C) Rel Hum (%) Wind Dir (10s deg) Wind Spd (km/h) Visibility (km) Stn Press (kPa) Hmdx Wind Chill Weather
3232109 3232167 4581278 199751.0 1547 7021.0 2019-01-01 00:08:00-05:00 Bay St / Albert St 7233.0 2019-01-01 00:33:00-05:00 King / Cowan Ave - SMART ... 4.3 4.0 98.0 8.0 11.0 4.8 98.49 NaN NaN Rain,Fog
3232110 3232168 4581279 294730.0 1112 7160.0 2019-01-01 00:10:00-05:00 King St W / Tecumseth St 7051.0 2019-01-01 00:29:00-05:00 Mutual St / Shuter St ... 4.3 4.0 98.0 8.0 11.0 4.8 98.49 NaN NaN Rain,Fog
3232111 3232169 4581280 197252.0 589 7055.0 2019-01-01 00:15:00-05:00 Jarvis St / Carlton St 7013.0 2019-01-01 00:25:00-05:00 Scott St / The Esplanade ... 4.3 4.0 98.0 8.0 11.0 4.8 98.49 NaN NaN Rain,Fog
3232112 3232170 4581281 171700.0 259 7012.0 2019-01-01 00:16:00-05:00 Elizabeth St / Edward St (Bus Terminal) 7235.0 2019-01-01 00:20:00-05:00 Bay St / College St (West Side) - SMART ... 4.3 4.0 98.0 8.0 11.0 4.8 98.49 NaN NaN Rain,Fog
3232113 3232171 4581282 306314.0 281 7041.0 2019-01-01 00:19:00-05:00 Edward St / Yonge St 7257.0 2019-01-01 00:24:00-05:00 Dundas St W / St. Patrick St ... 4.3 4.0 98.0 8.0 11.0 4.8 98.49 NaN NaN Rain,Fog
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5570570 5570628 7334123 474257.0 523 7098.0 2019-12-31 23:39:00-05:00 Riverdale Park South (Broadview Ave) 7339.0 2019-12-31 23:48:00-05:00 Carlaw Ave / Strathcona Ave ... 0.2 -3.7 75.0 25.0 35.0 16.1 99.30 NaN NaN NaN
5570571 5570629 7334124 330999.0 273 7044.0 2019-12-31 23:45:00-05:00 Church St / Alexander St 7273.0 2019-12-31 23:49:00-05:00 Bay St / Charles St - SMART ... 0.2 -3.7 75.0 25.0 35.0 16.1 99.30 NaN NaN NaN
5570572 5570630 7334125 521717.0 1055 7100.0 2019-12-31 23:51:00-05:00 Dundas St E / Regent Park Blvd 7100.0 2020-01-01 00:08:00-05:00 Dundas St E / Regent Park Blvd ... 0.2 -3.7 75.0 25.0 35.0 16.1 99.30 NaN NaN NaN
5570573 5570631 7334126 312708.0 459 7470.0 2019-12-31 23:55:00-05:00 York St / Lake Shore Blvd W 7102.0 2020-01-01 00:03:00-05:00 Nelson St / Duncan St ... 0.2 -3.7 75.0 25.0 35.0 16.1 99.30 NaN NaN NaN
5570574 5570632 7334127 361530.0 361 7168.0 2019-12-31 23:58:00-05:00 Queens Quay / Yonge St 7033.0 2020-01-01 00:04:00-05:00 Union Station ... 0.2 -3.7 75.0 25.0 35.0 16.1 99.30 NaN NaN NaN

2338466 rows × 24 columns

Since our graph is based on nodes and links, we need to map each bikeshare station to an intersection. Lets load in the bikeshare stations layer as a GeoDataFrame.

In [342]:
stations = pd.read_csv('bikeshare_stations.csv')
stations_gdf = gpd.GeoDataFrame(stations, geometry=gpd.points_from_xy(stations.lon, stations.lat))
stations_gdf.set_crs(epsg = '4326', inplace=True)
stations_gdf = stations_gdf.to_crs(epsg = '26917')
for index, row in stations_gdf.iterrows():
    rotated = shapely.affinity.rotate(row['geometry'], angle=-17, origin = Point(0, 0))
    stations_gdf.at[index, 'geometry'] = rotated

We'll only be mapping the stations to nodes that are in our network, so we'll filter out nodes that aren't connected to any valid links.

In [343]:
intersections_valid = intersections[intersections['INT_ID'].isin(list(centreline_valid['FNODE'].unique()))]

Making the X and Y coordinates discrete columns.

In [344]:
stations_gdf['x'] = stations_gdf.geometry.x
stations_gdf['y'] = stations_gdf.geometry.y
In [345]:
stations_gdf
Out[345]:
Station Id Station Name lat lon capacity geometry x y
0 7000 Fort York Blvd / Capreol Ct 43.639832 -79.395954 35 POINT (2014946.157 4437923.978) 2.014946e+06 4.437924e+06
1 7001 Lower Jarvis St / The Esplanade 43.647830 -79.370698 15 POINT (2017148.903 4438220.892) 2.017149e+06 4.438221e+06
2 7002 St. George St / Bloor St W 43.667333 -79.399429 19 POINT (2015513.197 4440938.865) 2.015513e+06 4.440939e+06
3 7003 Madison Ave / Bloor St W 43.667158 -79.402761 15 POINT (2015249.463 4440993.758) 2.015249e+06 4.440994e+06
4 7004 University Ave / Elm St 43.656518 -79.389099 11 POINT (2015985.467 4439555.301) 2.015985e+06 4.439555e+06
... ... ... ... ... ... ... ... ...
605 7663 Kilgour Rd / Rumsey Rd 43.718039 -79.371914 17 POINT (2019187.990 4445749.433) 2.019188e+06 4.445749e+06
606 7664 Sunnybrook Health Center - L Wing 43.722680 -79.376440 15 POINT (2018978.287 4446345.068) 2.018978e+06 4.446345e+06
607 7665 Sunnybrook Health Centre - S Wing 43.720669 -79.377553 11 POINT (2018830.957 4446154.767) 2.018831e+06 4.446155e+06
608 7666 Dundas St W / St Helen Ave 43.650422 -79.440765 26 POINT (2011792.301 4440046.061) 2.011792e+06 4.440046e+06
609 7667 Spadina Ave / Sussex Ave 43.664733 -79.403028 19 POINT (2015154.969 4440740.603) 2.015155e+06 4.440741e+06

610 rows × 8 columns

The first step to mapping the stations to an intersection is to create a cross join. This means each row will be joined to all other rows in both datasets. To do this, we'll create a zero column in both datasets, and join on that column of zeroes.

In [346]:
intersections_valid['key'] = 0
stations_gdf['key'] = 0

cross_join = intersections_valid.merge(stations_gdf[['key', 'Station Id', 'x', 'y']], how='outer')
/Users/Rick/opt/anaconda3/lib/python3.7/site-packages/geopandas/geodataframe.py:1322: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super(GeoDataFrame, self).__setitem__(key, value)
In [347]:
cross_join
Out[347]:
INT_ID INTERSEC5 CLASSIFI6 CLASSIFI7 geometry X Y key Station Id x y
0 13469938 GARDINER EXPRESS / Kipling Ave SEUML Pseudo Intersection-Overpass/Underpass POINT (2004207.843 4438469.314) 2.004208e+06 4.438469e+06 0 7000 2.014946e+06 4.437924e+06
1 13469938 GARDINER EXPRESS / Kipling Ave SEUML Pseudo Intersection-Overpass/Underpass POINT (2004207.843 4438469.314) 2.004208e+06 4.438469e+06 0 7001 2.017149e+06 4.438221e+06
2 13469938 GARDINER EXPRESS / Kipling Ave SEUML Pseudo Intersection-Overpass/Underpass POINT (2004207.843 4438469.314) 2.004208e+06 4.438469e+06 0 7002 2.015513e+06 4.440939e+06
3 13469938 GARDINER EXPRESS / Kipling Ave SEUML Pseudo Intersection-Overpass/Underpass POINT (2004207.843 4438469.314) 2.004208e+06 4.438469e+06 0 7003 2.015249e+06 4.440994e+06
4 13469938 GARDINER EXPRESS / Kipling Ave SEUML Pseudo Intersection-Overpass/Underpass POINT (2004207.843 4438469.314) 2.004208e+06 4.438469e+06 0 7004 2.015985e+06 4.439555e+06
... ... ... ... ... ... ... ... ... ... ... ...
24154775 13455308 Brentcliffe Rd / Broadway Ave MNRSL Minor-Single Level POINT (2019992.486 4445424.112) 2.019992e+06 4.445424e+06 0 7663 2.019188e+06 4.445749e+06
24154776 13455308 Brentcliffe Rd / Broadway Ave MNRSL Minor-Single Level POINT (2019992.486 4445424.112) 2.019992e+06 4.445424e+06 0 7664 2.018978e+06 4.446345e+06
24154777 13455308 Brentcliffe Rd / Broadway Ave MNRSL Minor-Single Level POINT (2019992.486 4445424.112) 2.019992e+06 4.445424e+06 0 7665 2.018831e+06 4.446155e+06
24154778 13455308 Brentcliffe Rd / Broadway Ave MNRSL Minor-Single Level POINT (2019992.486 4445424.112) 2.019992e+06 4.445424e+06 0 7666 2.011792e+06 4.440046e+06
24154779 13455308 Brentcliffe Rd / Broadway Ave MNRSL Minor-Single Level POINT (2019992.486 4445424.112) 2.019992e+06 4.445424e+06 0 7667 2.015155e+06 4.440741e+06

24154780 rows × 11 columns

After cross joining the two datasets, we'll calculate the euclidean distance between the stations and the intersections. Since this a cross join, this is being done for every possible combination of station-intersection.

Because we're in a mercator projection, and working on a relatively small area, we can make the assumption that our surface is planar, and use pythagoreas theorem to calculate the distance. This will significantly speed up the calculation rather than relying on a function to process 24 million rows.

In [348]:
cross_join['dist'] = (cross_join['X'] - cross_join['x'])**2 + (cross_join['Y'] - cross_join['y'])**2

To complete the mapping, we'll group by the station ID, and take the lowest distance, before rejoining the data back to the intersections dataset to grab the INT_ID.

In [349]:
intersections_lookup = cross_join.groupby('Station Id').agg({'dist':'min'}).merge(cross_join, left_on = ['dist'], right_on = ['dist'], how = 'inner')
In [350]:
intersections_lookup = intersections_lookup.sort_values(by='dist').drop_duplicates('Station Id', keep = 'first').reset_index()[['INT_ID', 'Station Id']]

Now we can do some joins to map each Station Id in the trips dataset to an INT_ID.

In [351]:
trips = trips[['Start Station Id', 'End Station Id']]
trips = trips.merge(intersections_lookup, left_on = 'Start Station Id', right_on = 'Station Id')
In [352]:
trips = trips.rename(columns = {'INT_ID': 'START INT'})
trips = trips[['Start Station Id', 'End Station Id', 'START INT']]
In [353]:
trips
Out[353]:
Start Station Id End Station Id START INT
0 7021.0 7233.0 13466250
1 7021.0 7253.0 13466250
2 7021.0 7323.0 13466250
3 7021.0 7052.0 13466250
4 7021.0 7036.0 13466250
... ... ... ...
2338461 7510.0 7226.0 30016301
2338462 7510.0 7411.0 30016301
2338463 7510.0 7043.0 30016301
2338464 7510.0 7411.0 30016301
2338465 7510.0 7243.0 30016301

2338466 rows × 3 columns

In [354]:
trips = trips.merge(intersections_lookup, left_on = 'End Station Id', right_on = 'Station Id')
trips = trips.rename(columns = {'INT_ID': 'END INT'})
trips = trips[['START INT', 'END INT']]

To avoid running networkx through 2 million trips, we'll only run networkx through unique origin destination pairs, then multiply the resulting path by the number of trips made for that origin destination pair. We can do that by grouping by both the START INT and END INT.

This brings us down to 112,705 replications that we need to run networkx for.

In [355]:
trip_od = trips.groupby(['START INT', 'END INT']).size().reset_index().rename(columns = {0: 'trips'})
In [356]:
trip_od
Out[356]:
START INT END INT trips
0 13455496 13455496 14
1 13455496 13455992 6
2 13455496 13456388 19
3 13455496 13456611 9
4 13455496 13456893 25
... ... ... ...
112700 30111086 30104384 2
112701 30111086 30105329 1
112702 30111086 30107059 55
112703 30111086 30109166 3
112704 30111086 30111086 23

112705 rows × 3 columns

In [357]:
trip_od.sort_values(by = 'trips', ascending = False)
Out[357]:
START INT END INT trips
73790 13467729 13467389 2043
110044 30102397 13468181 2001
98995 30021888 30021888 1787
79548 13468181 30102397 1678
50169 13466203 13466203 1555
... ... ... ...
31904 13464870 30098085 1
96370 30012003 13463195 1
68242 13467348 30008207 1
96367 30012003 13462831 1
75212 13467820 13464522 1

112705 rows × 3 columns

Running networkx to find the path for each OD pair

Some assumptions this process is making:

  • Topography doesn't matter. We didn't factor the length based on grade/elevation changes, so a cyclist will assume a flat route and a hilly route of the same cost is the same
  • There are no constraints other than minimizing the cost. Its definitely possible some riders may complete a tour of the city, avoid using the shortest path, but this analysis won't take that into account. Similarly, if a rider makes a mid route detour, this won't be captured.
  • Bikeshare users might be more conservative than regular cyclists in choosing their paths. Unfortunately, we couldn't afford to spend more time on researching this behaviour.
In [358]:
trip_od['trips'].sum()
Out[358]:
2338466

Additionally, a small number of trips (capturing the "touring" behaviour) start and end at the same station. This, and similar trips will not be properly captured in networkx since it will just return an empty path.

In [359]:
trip_od[trip_od['START INT'] == trip_od['END INT']]['trips'].sum()
Out[359]:
50203

As an example, lets route the most popular origin-destination pair, from 30102397 to 13468181. This pair has 2043 trips, and is from Queens' Quay and Bathurst, to Queens Quay and York.

The first step is to use the shortest path function. This uses the dijkstra shortest path algorithm to calculate the shortest path, based on our cost column length_adj.

Then, we can create another graph from that path that only contains the nodes and links that were used to construct that path. This is how we can extract the links from the graph.

In [360]:
route = []
most_popular_graph = nx.shortest_path(g, 30102397, 13468181, weight='length') # dijkstra

most_popular_route = nx.path_graph(most_popular_graph)  

for ea in most_popular_route.edges():
    route.append(g.edges[ea[0], ea[1]]['key'])
In [361]:
centreline_unrotate = gpd.read_file('centreline.geojson')

Now we can grab the associated geometries from the centreline file to map it. The basemap function contextily requires us to use web mercator, so we'll need to use epsg = 3857.

In [362]:
route_poly = centreline_unrotate[centreline_unrotate['GEO_ID'].isin(route)].to_crs(epsg = '3857')
boundary = gpd.read_file('to_boundary.geojson')
In [363]:
plt.rcParams['figure.dpi'] = 450
In [364]:
import contextily as cx
to_basemap = cx.Place("Toronto, Canada")
In [365]:
fig, ax = plt.subplots()

ax.set_title('Queens Quay from Bathurst to York Street is the Busiest Origin-Destination Pair', fontsize = 10, fontweight = 'bold')

fig.set_size_inches(7,4)

route_poly.plot(ax = ax, linewidth = 4)
cx.add_basemap(ax, source=cx.providers.OpenStreetMap.Mapnik)
ax.set_axis_off()

#north arrow
x, y, arrow_length = 0.85, 0.4, 0.15
ax.annotate('N', xy=(x, y), xytext=(x, y- 0.20),
            arrowprops = dict(facecolor='black', width=1, headwidth=6),
            ha='center', va='center', fontsize=10,
            xycoords=ax.transAxes)


ax.add_artist(ScaleBar(1, location = 'lower right', length_fraction = 0.25))# scale bar
Out[365]:
<matplotlib_scalebar.scalebar.ScaleBar at 0x7fd5a4c24690>
In [366]:
trip_od = trip_od[trip_od['START INT'] != trip_od['END INT']]
trip_od
Out[366]:
START INT END INT trips
1 13455496 13455992 6
2 13455496 13456388 19
3 13455496 13456611 9
4 13455496 13456893 25
5 13455496 13457177 3
... ... ... ...
112699 30111086 30100054 12
112700 30111086 30104384 2
112701 30111086 30105329 1
112702 30111086 30107059 55
112703 30111086 30109166 3

112256 rows × 3 columns

Now that we know the process for a single origin-destination pair, we can operationalize this in a loop. For each loop, once we extract the links, we'll assign it to a list that will save the number of trips, and the links used in that trip.

We're using a try/except statement because certain pairs could not be routed and broke the loop. This ensures that even if a trip could not be routed, we can still continue the loop.

In [367]:
start = time.time()
route_table = []
unroutable = []
next_start = time.time()

for index, row in trip_od.iterrows():

    try:
        sp = nx.shortest_path(g, row['START INT'], row['END INT'], weight='length') # dijkstra algorithm

        pathGraph = nx.path_graph(sp) # assigning subgraph

        for ea in pathGraph.edges(): # appending edges in shortest path to list
            route_table.append([g.edges[ea[0], ea[1]]['key'], row['trips']])
    except:
        unroutable.append([row['START INT'], row['END INT'], row['trips']])
        
    if index%5000 == 0:
        print(index, time.time() - next_start)
        next_start = time.time()
        
print(time.time() - start)
5000 128.49723386764526
10000 94.51697301864624
15000 84.65551280975342
20000 86.0250608921051
25000 74.0934100151062
30000 70.85792303085327
35000 80.82517695426941
40000 62.64110016822815
45000 67.14286303520203
50000 56.84524703025818
55000 808.9101841449738
60000 72.0498399734497
65000 49.30156493186951
70000 40.46546983718872
75000 35.25971508026123
80000 38.99359583854675
85000 56.84396004676819
90000 75.8730080127716
100000 121.75331473350525
105000 75.83816814422607
110000 68.28335690498352
2283.8705089092255

After running for 30 minutes, we successfully routed almost every pair. However, 12.7% of all trips could not find a path.

For this project, we didn't spend any time investigating the unroutable trips, but from the results, it seemed like only Old Toronto trips were successfully routed.

In [368]:
unroutable_df = pd.DataFrame.from_records(unroutable, columns = ['START INT', 'END INT', 'trips'])
round(unroutable_df['trips'].sum()/trip_od['trips'].sum()*100,3)
Out[368]:
12.692
In [369]:
pd.DataFrame.from_records(route_table).to_csv('routes.csv') # saving the file
In [370]:
centreline_counts = pd.DataFrame.from_records(route_table)
In [371]:
centreline_counts.columns = ['GEO_ID', 'volume']

To condense the table, we'll group by each GEO_ID, and then sum up all the trips occuring on that segment to get a column we'll name volume. Then we'll merge it back onto the centreline layer to grab other data we need.

In [372]:
centreline_counts = centreline_counts.groupby('GEO_ID').sum().sort_values(by = 'volume', ascending = False).reset_index()
In [373]:
centreline_counts = centreline.merge(centreline_counts, how = 'inner')

For a later chart, lets find the top 500 road segments for bikeshare users.

In [374]:
top_500 = centreline_counts.sort_values(by = 'volume', ascending = True).tail(500)
In [375]:
to_boundary = gpd.read_file('toronto.geojson')
to_boundary = to_boundary.to_crs(epsg = '26917')
for index, row in to_boundary.iterrows():
    rotated = shapely.affinity.rotate(row['geometry'], angle=-17, origin = Point(0, 0))
    to_boundary.at[index, 'geometry'] = rotated
In [376]:
fig, ax = plt.subplots()
fig.set_size_inches(7,4)

divider = make_axes_locatable(ax) 

cax = divider.append_axes("right", size="5%", pad=0.1) #to clean up the chloropleth legend

centreline_counts.plot(ax = ax, column = 'volume', cmap = 'inferno_r', legend = True, linewidth = 0.75, zorder = 2,
                     legend_kwds={'label': "Volume of Cyclists"}, cax = cax, vmin = -0, vmax = 120000)


ax.set_title('Predicted volumes of BikeShare trips in 2019, using dijkstra shortest path algorithm', fontsize = 8)

ax.set_yticklabels([])
ax.set_xticklabels([])
ax.set_axis_off()

ax.add_artist(ScaleBar(1, location = 'lower right', length_fraction = 0.25))# scale bar
Out[376]:
<matplotlib_scalebar.scalebar.ScaleBar at 0x7fd5a76e1550>

This inital chloropleth plot of the busyness of each segment is a little overplotted. Almost every road in Old Toronto has a trip, and its very likely that these roads only receive only a few trips for all of 2019.

In [377]:
ylim = ax.get_ylim() #saving the extents for a later map
xlim = ax.get_xlim()

Lets create a simple histogram to understand the distribution of our data.

In [378]:
fig, (ax1, ax2) = plt.subplots(1,2)
centreline_counts.hist('volume', bins = 250, ax = ax1)
centreline_counts.hist('volume', bins = 250, ax = ax2)
ax1.set_xlim([0,1000])
Out[378]:
(0, 1000)

This histogram that I created confirms that picutre. The right plot shows the entire histogram, while the left shows only the 0 to 1000 section. As a solution, lets only plot the top quartile of road segments.

In [379]:
centreline_counts['volume'].quantile([0.25,0.5,0.75])
Out[379]:
0.25     132.0
0.50     954.5
0.75    5107.0
Name: volume, dtype: float64
In [402]:
fig, ax = plt.subplots()
fig.set_size_inches(7,4)

divider = make_axes_locatable(ax) 

ax.set_facecolor('xkcd:sky blue')

cax = divider.append_axes("right", size="5%", pad=0.1) #to clean up the chloropleth legend

centreline[centreline['FCODE_DESC'].isin(['Expressway', 'Expressway Ramp', 'Major Arterial Ramp', 'Minor Arterial Ramp'])].plot(
    zorder = 1, color = 'xkcd:grey', ax = ax, linewidth = 0.5, alpha = 0.5)

centreline[centreline['FCODE_DESC'] == 'Major Arterial'].plot(zorder = 1, color = 'xkcd:grey', ax = ax, linewidth = 0.3, alpha = 0.5)
centreline[centreline['FCODE_DESC'] == 'Minor Arterial'].plot(zorder = 1, color = 'xkcd:grey', ax = ax, linewidth = 0.3, alpha = 0.3)



to_boundary.buffer(40).plot(zorder = 0, color = '#fffefc', ax = ax)
#to_boundary.plot(zorder = 0, color = 'w', ax = ax)

centreline_counts[centreline_counts['volume'] > 5107].plot(ax = ax, column = 'volume', cmap = 'inferno_r', 
                                                         legend = True, linewidth = 0.75, zorder = 2,
                     legend_kwds={'label': "Volume of Cyclists"}, cax = cax, vmin = 0, vmax = 120000)

plt.suptitle('Downtown and the Waterfront have the busiest roads for cyclists', fontsize = 12, fontweight = 'bold', y = 0.94)

ax.set_title('Predicted volumes of BikeShare trips in 2019, using dijkstra shortest path algorithm, top 25% of roads only', fontsize = 8)

ax.set_yticklabels([])
ax.set_xticklabels([])
ax.set_xticks([])
ax.set_yticks([])
#ax.set_axis_off()

ax.set_ylim(ylim)
ax.set_xlim(xlim)

ax.add_artist(ScaleBar(1, location = 'lower right', length_fraction = 0.25))# scale bar
Out[402]:
<matplotlib_scalebar.scalebar.ScaleBar at 0x7fd58b15a1d0>

That's a lot better. From the map, we can see that key bike lanes, such as Sherbourne, the Martin Goodman Trail, Simcoe, Bloor, Beverley/St George, Brunswick, and Hoskin/Harbord/Wellesley dominate the counts. Overall, like with the trips by neighbourhood map, bikeshare trips are concentrated downtown, with some activity spreading out along the waterfront.

There are some interesting notes. Some roads with no bike lanes, such as The Esplanade and Victoria Street have high ridership. Our process favours residential roads, so those roads acted as a by pass for users on Yonge and King, even if those residential roads have no levels of bike infrastructure.

Another note is the section of the Martin Goodman Trail by Roncy. Because we didn't consider elevation changes, it routed cyclists on the bridge over the Gardiner, however this doesn't match reality. This is likely also the case for why the Riverdale Bridge over the DVP has more activity than Dundas, Gerrard, or the Prince Edward Viaduct.

In [381]:
fig, ax = plt.subplots()
fig.set_size_inches(7,4)

divider = make_axes_locatable(ax) 

ax.set_facecolor('xkcd:sky blue')

cax = divider.append_axes("right", size="5%", pad=0.1)

centreline[centreline['FCODE_DESC'].isin(['Expressway', 'Expressway Ramp', 'Major Arterial Ramp', 'Minor Arterial Ramp'])].plot(
    zorder = 1, color = 'xkcd:grey', ax = ax, linewidth = 0.5, alpha = 0.5)

centreline[centreline['FCODE_DESC'] == 'Major Arterial'].plot(zorder = 1, color = 'xkcd:grey', ax = ax, linewidth = 0.3, alpha = 0.5)
centreline[centreline['FCODE_DESC'] == 'Minor Arterial'].plot(zorder = 1, color = 'xkcd:grey', ax = ax, linewidth = 0.3, alpha = 0.3)



to_boundary.buffer(40).plot(zorder = 0, color = '#fffefc', ax = ax)
#to_boundary.plot(zorder = 0, color = 'w', ax = ax)

top_500.plot(ax = ax, column = 'volume', cmap = 'inferno_r', 
                                                         legend = True, linewidth = 3, zorder = 2,
                     legend_kwds={'label': "Volume of Cyclists"}, cax = cax, vmin = 20000, vmax = 120000)


plt.suptitle('Simcoe, Martin Goodman Trail, Bloor, Sherbourne,\nWellesley, Beverley, and Esplanade are Cycling Hotspots', fontsize = 12, fontweight = 'bold', y = 1.02)

ax.set_title('Top 500 road segments for bikeshare trips, using dijkstra shortest path algorithm', fontsize = 8)

ax.set_yticklabels([])
ax.set_xticklabels([])
ax.set_xticks([])
ax.set_yticks([])
#ax.set_axis_off()

ax.set_ylim([4436000, 4442000])
ax.set_xlim([2011000,2021000])

ax.add_artist(ScaleBar(1, location = 'lower right', length_fraction = 0.25))# scale bar
Out[381]:
<matplotlib_scalebar.scalebar.ScaleBar at 0x7fd582f40050>

The top 500 road segments confirms the map above, except we can clearly see the key corridors that have a lot of bike traffic. An obvious place for the city to create more bike infrastructure is to create another crossing under the Gardiner/Rail corridor, to relieve traffic on Simcoe (such as extending the Bay Street bike lanes). Another east downtown North South Bike lane may also be attractive since Sherbourne and Victoria streets are pretty busy.

Anecdotally, the Queen's Quay section of the Martin Goodman Trail has always been the slowest and busiest section, and this map proces quantitatively that it may be worth looking into another way to relieve bike traffic around the ferry terminal.

Analyzing the busiest corridor.

We'll use another concept from transportation research called vehicles kilometers traveled, or VKT for short. We can't easily sum up the volumes that lie on the same street, but we can use VKT to account for distance when grouping road segments into road corridors.

VKT is simply the volume of cyclists a segment has multiplied by the length of that segment.

In [382]:
centreline_counts['vkt'] = centreline_counts['volume']*centreline_counts['length']/1000

This allows us to sum up te VKT grouped by road. Lets find out the top 10 corridors for bikeshare VKT.

In [383]:
road_top10 = centreline_counts.groupby('LF_NAME').sum()[['vkt']].sort_values(ascending = False, by = 'vkt').head(10)
In [384]:
road_top10 = road_top10.reset_index()

We'll also want to map the bike infrastructure to each road.

In [385]:
bikeways_unique = bikeways.sort_values(by = 'STREET_NAME')[['STREET_NAME', 'INFRA_HIGHORDER']].drop_duplicates(keep = 'first')
In [386]:
road_top10 = road_top10.merge(bikeways_unique, left_on = 'LF_NAME', right_on = 'STREET_NAME', how = 'left')
In [387]:
road_top10
Out[387]:
LF_NAME vkt STREET_NAME INFRA_HIGHORDER
0 Martin Goodman Trl 530857.575617 Martin Goodman Trl MUT (2016 Network Plan/2012 Trails Plan)
1 Martin Goodman Trl 530857.575617 Martin Goodman Trl Multi-Use Trail
2 Sherbourne St 180834.041867 Sherbourne St Cycle Track
3 Bloor St W 144841.252692 Bloor St W Sharrows - Arterial
4 Bloor St W 144841.252692 Bloor St W Cycle Track
5 Lake Shore Blvd East Trl 122054.158288 NaN NaN
6 Gerrard St E 108276.610236 Gerrard St E Bike Lane
7 Gerrard St E 108276.610236 Gerrard St E Cycle Track
8 Bay St 106530.909307 Bay St Sharrows - Arterial
9 Bay St 106530.909307 Bay St Bike Lane
10 College St 89819.934335 College St Bike Lane
11 College St 89819.934335 College St Sharrows - Arterial
12 Lower Don River Trl 86950.221554 Lower Don River Trl Multi-Use Trail
13 Lower Don River Trl 86950.221554 Lower Don River Trl MUT (2016 Network Plan/2012 Trails Plan)
14 Victoria St 85219.877548 NaN NaN
15 The Esplanade 83934.034067 The Esplanade Signed Route (No Pavement Markings)

Some roads have multiple levels of bike infrasturcture because of different segments. We'll manually choose the most appropriate, but for a more comprehensize analysis, we'll ideally want an automated process.

Similarly, we'll need to manually clean up the INFRA_HIGHORDER column.

In [388]:
road_top10['INFRA_HIGHORDER'] = np.where(road_top10['INFRA_HIGHORDER'] == 'MUT (2016 Network Plan/2012 Trails Plan)',
                                          'Multi-Use Trail', road_top10['INFRA_HIGHORDER'])
road_top10['INFRA_HIGHORDER'] = np.where(road_top10['INFRA_HIGHORDER'] == 'Contraflow Cycle Track',
                                          'Cycle Track', road_top10['INFRA_HIGHORDER'])
road_top10['INFRA_HIGHORDER'] = np.where(road_top10['INFRA_HIGHORDER'] == 'Sharrows - Arterial',
                                          np.nan, road_top10['INFRA_HIGHORDER'])
road_top10['INFRA_HIGHORDER'] = np.where(road_top10['LF_NAME'] == 'Yonge St',
                                          np.nan, road_top10['INFRA_HIGHORDER'])
road_top10['INFRA_HIGHORDER'] = np.where(road_top10['LF_NAME'] == 'Lake Shore Blvd East Trl',
                                          'Multi-Use Trail', road_top10['INFRA_HIGHORDER'])
In [389]:
road_top10 = road_top10.sort_values(by = 'INFRA_HIGHORDER').drop_duplicates(['LF_NAME', 'vkt'], keep = 'first'
                                                              ).sort_values(by = 'vkt', ascending = False)
In [390]:
road_top10['INFRA_HIGHORDER'] = road_top10['INFRA_HIGHORDER'].fillna('None') # so that we get a legend colour
road_top10 = road_top10.rename(columns = {'INFRA_HIGHORDER':'Bike Infrastructure'})
In [391]:
road_top10 
Out[391]:
LF_NAME vkt STREET_NAME Bike Infrastructure
0 Martin Goodman Trl 530857.575617 Martin Goodman Trl Multi-Use Trail
2 Sherbourne St 180834.041867 Sherbourne St Cycle Track
4 Bloor St W 144841.252692 Bloor St W Cycle Track
5 Lake Shore Blvd East Trl 122054.158288 NaN Multi-Use Trail
6 Gerrard St E 108276.610236 Gerrard St E Bike Lane
9 Bay St 106530.909307 Bay St Bike Lane
10 College St 89819.934335 College St Bike Lane
12 Lower Don River Trl 86950.221554 Lower Don River Trl Multi-Use Trail
14 Victoria St 85219.877548 NaN None
15 The Esplanade 83934.034067 The Esplanade Signed Route (No Pavement Markings)
In [405]:
centreline_counts['vkt'].sum()
Out[405]:
4731319.432600401
In [392]:
fig, ax = plt.subplots()

fig.set_size_inches(7,4)

sns.barplot(y = 'LF_NAME', x = 'vkt', data = road_top10, ax = ax, orient = 'h', hue = 'Bike Infrastructure', dodge = False)

ax.set_yticks(range(10))
ax.set_xticks(range(0,700000,100000))

ax.set_ylabel('Road/Trail')
ax.set_xlabel('Vehicle Kilometers Travelled')

ax.set_title('Bikeshare Traffic in 2019, by Road', fontsize = 10)
plt.suptitle('The Martin Goodman Trail is the busiest bike path', fontweight = 'bold', fontsize = 12)

ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))

ax.set_yticklabels(road_top10['LF_NAME'])

plt.show()

Its clear that the Martin Goodman Trail has high VKT and high traffic, and thats one of the main reasons why West End Toronto is number 2 on the busiest boroughs for Bikeshare riders. Having 11.2% of the total 4.7 million VKT in the city (or in other words, 11.2% of total bikeshare traffic), this shows that its one of the most important corridors in the city for east west travel.

Most of the other top corridors have some level of bike infrastructure with only Victoria Street and The Esplanade not having any (although The Esplanade was part of ActiveTO and is in the planning process to get permanent bikelanes).

Its unclear how much riders are actually using Vicotria Street, as opposed to Yonge Street (due to the way we prioritized routes), but this does make the case for additional bike infrastructure in the yongeTOmorrow initative by the city.